


Institutional Archive of the Naval Postgraduate School 





Calhoun: The NPS Institutional Archive 
DSpace Repository 


Theses and Dissertations 1. Thesis and Dissertation Collection, all items 


1995-12 


Eureka: a distributed shared memory system 
based on the Lazy Data Merging consistency model 


Tavares, Joao Alberto Vianna 


Monterey, California. Naval Postgraduate School 
http://ndl.handle.net/10945/35209 


This publication is a work of the U.S. Government as defined in Title 17, United 
States Code, Section 101. Copyright protection is not available for this work in the 
United States. 


Downloaded from NPS Archive: Calhoun 


Calhoun is the Naval Postgraduate School's public access digital repository for 
(8 DUDLEY research materials and institutional publications created by the NPS community. 
«ist sae Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS'‘s first 


INN KNOX appointed — and published -- scholarly author. 

| LIBRARY Dudley Knox Library / Naval Postgraduate School 

411 Dyer Road / 1 University Circle 
Monterey, California USA 93943 





http://www.nps.edu/library 








NAVAL POSTGRADUATE SCHOOL 
Monterey, California 





THESIS 


EUREKA: A DISTRIBUTED SHARED MEMORY SYSTEM 
BASED ON THE LAZY DATA MERGING 
CONSISTENCY MODEL 


by 


Joao Alberto Vianna Tavares 


September 1995 


Thesis Advisor: 





Approved for public release; distribution is unlimited. 


oetzig Omg = 











REPORT DOCUMENTATION PAGE Ge oe 


Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time reviewing instructions, searching existing data sources 
gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this 
collection of information, including suggestions for reducing this burden to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson 
Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188), Washington, DC 20503. 


1. AGENCY USE ONLY (Leave Blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED 
September 1995 Master’s Thesis 


4. TITLE AND SUBTITLE 5. FUNDING NUMBERS 
Eureka: a Distributed Shared Memory System Based on the Lazy Data 


Merging Consistency Model (U) 























6. AUTHOR(S) 
Tavares, Joao Alberto Vianna 







7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 
Naval Postgraduate School 


Monterey, CA 93943-5000 






8. PERFORMING ORGANIZATION 
REPORT NUMBER 








10. SPONSORING/ MONITORING 
AGENCY REPORT NUMBER 









11. SUPPLEMENTARYNOTES ~ 
The views expressed in this thesis are those of the author and do not reflect the official policy or position 


of the Department of Defense or the United States Government. 













12a. DISTRIBUTION /AVAILABILITY STATEMENT = a 
Approved for public release; distribution is unlimited. 





12b. DISTRIBUTION CODE 










13. ABSTRACT (Maximum 200 words) 
Distributed Shared Memory (DSM) provides an abstraction of shared memory on a network of workstations. 
Problems with existing DSM systems are lack of portability due to compiler and/or operating system modification 
requirements, and reduced performance due to significant synchronization and communication costs when 
compared to their message passing counterparts (e.g., PVM and MPI). 

Our approach was to introduce anew DSM consistency model, Lazy Data Merging (LDM), which extends 
Data Merging (DM). LDM is optimized for software runtime implementations and differs from DM by “lazily” 
placing data updates across the communication network only when they are required. It is our belief that LDM can 
significantly reduce communication costs, particularly for applications that make extensive use of locks. 

We have completed the design of “Eureka” , aprototype DSM system that provides a software implementation 
of the LDM consistency model. To ensure portability and efficiency we use only standard Unix™ system calls and 
a publicly available software thread package, Cthreads, from the University of Utah. Futhermore, we have 
implemented and tested some of Eureka’s core components, specifically, the set of communication and hybrid 
(Invalidate/Update) coherence primitives, which are essential for follow on work in building the complete DSM 
system. The question of efficiency is still an open problem, because we did not compare Eureka with other DSM 
implementations. 




















. SUBJECT TERMS 
Distributed shared memory, lazy data merging, survey on consistency models 


and network programming. 


15. NUMBER OF PAGES 
137 














17. SECURITY CLASSIFICATION 





; 18. SECURITY CLASSIFICATION 
OF REPORT OF THIS PAGE OF ABSTRACT 


Unclassified Unclassified Unclassified 


NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) 
1 Prescribed by ANSI Std. 239-18 


19. SECURITY CLASSIFICATION 
















Approved for public release; distribution is unlimited 


EUREKA: A DISTRIBUTED SHARED MEMORY SYSTEM 
BASED ON THE LAZY DATA MERGING 
CONSISTENCY MODEL 


Joao Alberto Vianna Tavares 
Lieutenant, Brazilian Navy 
B.S., Brazilian Naval Academy, 1983 


Submitted in partial fulfillment of the 
requirements for the degree of 


MASTER OF SCIENCE IN COMPUTER SCIENCE 
from the 
NAVAL POSTGRADUATE SCHOOL 


September 1995 


Author: 


Joao JAlberto Vianna Tavares 





Approved by: 


Amr Zaky, Thesis Advisor 





an-Tak Shing, SeCond Reader 


Ted Lewis, Chairman, 
Department of Computer Science 








1V 











ABSTRACT 


Distributed Shared Memory (DSM) provides an abstraction of shared memory on a 
network of workstations. Problems with existing DSM systems are lack of portability due 
to compiler and/or operating system modification requirements, and reduced performance 
due to significant synchronization and communication costs when compared to their 
message passing Counterparts (e.g., PVM and MPI). 

Our approach was to introduce a new DSM consistency model, Lazy Data Merging 
(LDM), which extends Data Merging (DM). LDM is optimized for software runtime 
implementations and differs from DM by “lazily” placing data updates across the 
communication network only when they are required. It is our belief that LDM can 
significantly reduce communication costs, particularly for applications that make extensive 
use of locks. 

We have completed the design of “Eureka” , a prototype DSM system that provides a 
software implementation of the LDM consistency model. To ensure portability and 
efficiency we use only standard Unix™ system calls and a publicly available software 
thread package, Cthreads, from the University of Utah. Futhermore, we have implemented 
and tested some of Eureka’s core components, specifically, the set of communication and 
hybrid (Invalidate/Update) coherence primitives, which are essential for follow on work in 
building the complete DSM system. The question of efficiency is still an open problem, 


because we did not compare Eureka with other DSM implementations. 
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I. INTRODUCTION 


A. BACKGROUND AND MOTIVATION 


Parallel processing is the upcoming alternative for expanding processing power. The 
trend towards this technology is derived from the availability of high performance 
communication networks and the perception that improvements in computer performance 
from hardware innovations are limited as we approach various physical limits. 

There are two widely accepted models for parallel programming: shared memory and 
message passing. The shared memory model is a direct extension of the conventional 
uniprocessor programming model. In this model, processors interact by modifying data 
objects stored in the shared address space. Multiple Instruction stream Multiple Data 
stream (MIMD) shared-address-space computers, often referred to as multiprocessors, are 
examples of parallel architectures that employ such memory model. A major drawback of 
such architectures is that the bandwidth of the interconnection network must be substantial 
to ensure a scalable performance. To reduce the problem, a fast local memory (cache) is 
incorporated. This local memory concept can further be extended to eliminate physically 
shared memory entirely as in Cache Only Memory Architectures (COMA) [ML95}]. 

According to the memory access time to global and local data, multiprocessor 
architectures can be classified as Uniform Memory Access (UMA), when the time taken by 
a processor to access any memory word in the system is identical, or Nonuniform Memory 
Access (NUMA), whenever the time to access a remote memory bank is longer than the 
time to access a local one. 

Distinct from the previous programming model, the “message passing model” does 
not support the single shared memory abstraction. Instead, each processor has its own 
private memory which is invisible to other processors. These types of architectures can be 
referred as No Remote Memory Access (NORMA) architectures [TN95]. In this model, 
processors can communicate only through explicit “message passing” , dispensing the need 


for enforcing cache coherence as is the case for shared memory systems. 


The message passing paradigm is widely used on MIMD distributed memory 
multiprocessors. This type of architecture allows combining inexpensive, off-the-shelf 
processors connected through an interconnection network or through a hierarchical bus or 
ring architecture, resulting in a very scalable design, delivering high performance 


computing. The next section addresses the issues involved on both programming models. 


B. MESSAGE PASSING VERSUS SHARED MEMORY 


There are advantages and disadvantages to message-passing as well as to shared 
memory. “Message-passing” gives programmers and compilers explicit control over the 
choice of data communicated and over the time of transmission as opposed to the shared 
memory paradigm. With appropriate interfaces and protocols, it is relatively easy to 
overlap computation with communication. The explicit nature of message-passing is 
perceived also as its main weakness [CBZ91]; programmers and compilers need to plan and 
to program explicitly every communication action. Such planning is especially difficult for 
applications that use data dependent communication actions. When exact access patterns 
are not known, performance is affected by the volume of data communicated, the number 
of messages sent, and the amount of time that processes must wait for messages to be 
delivered. 

In contrast, Shared Memory can be viewed as a simpler and more intuitive abstraction 
[ACDB94]. A Distributed Shared Memory (DSM) system can be defined as an abstraction 
of a single shared address space, implemented on a distributed memory multiprocessor. 
DSM allows processes to assume a globally shared virtual memory even though they 
execute in nodes that do not physically share memory. The DSM software provides the 
abstraction of a globally shared memory, in which each processor can access any data item, 
without the programmer having to worry about where the data is, or how to move this value 
to the appropriate processor. On a DSM system the programmer can concentrate on 
algorithmic development rather than on managing partitioned data sets and communicating 


values. In addition to ease of programming, DSM provides the same programming 























environment as that on hardware implementations of shared-memory multiprocessors, 
simplifying the portability between the two environments. A common application for the 
DSM concept allows for seamless integration of shared memory workstations in a network 
environment. Figure 1 illustrates a DSM system consisting of N networked workstations, 


each with its own memory, connected by a network. 


Network 


Shared Memory abstraction 





Figure 1: DSM abstraction 


In summary, we can state that the advantages of the DSM model over message passing 


are: 


¢ Shared memory programs are usually shorter and easier to understand than their 
message passing counterparts. 








¢ Large or complex data structures may easily be communicated. 
¢ Shared memory gives transparent process-to-process communication. 
¢ Programming with shared memory is a well-understood problem. 


C. PROBLEM STATEMENT 


DSM _ systems combine features of shared memory and distributed memory 
multiprocessors. They support the relatively simple and portable programming model of 
shared memory on physically distributed memory hardware, which is more scalable and 
less expensive to build than the shared memory hardware. Although a large number of 
DSM systems (Munin [CBZ92], Midway [BZS93] and [ZSB94], etc.) have been proposed 
and implemented they are still not widely in use. Some evident reasons are lack of 
portability, low performance when compared with their message passing counterparts, and 
the need for extensive modifications of existing programs. 

The challenge in building a DSM system is to achieve good performance over a wide 
range of parallel programs without requiring extensive program restructuring by the 
programmer [CBZ92]. The overhead of maintaining consistency in software and the high 
latency of sending messages make this rather difficult. The primary source of DSM 
overhead is the large amount of communication that is required to maintain consistency and 
the operating system cost to prepare a message associated with the network latency. The 
conjunction of these two factors penalizes the overall system performance. 

Data Merging [KS93] provides means for implementing a DSM system with 
communication costs comparable to the previous proposals, but requiring only small 
hardware changes. The goal of this research is to extend such protocol to a portable 
software implementation and verify its feasibility. Since for software implementations, the 
communication costs are much higher than on a shared memory multiprocessor, we impose 


some restrictions on the Data Merging protocol as originally proposed. 














D. CONTRIBUTION 


The major contribution of this thesis is to extend the Data Merging as a viable 
consistency protocol for implementing DSM. In order to achieve performance results 
comparable to previous DSM implementations and to provide a easily portable solution 
without requiring changes to the application programs, we introduced some modifications 


to the original protocol as proposed by Karp and Sarkar in [KS93]. 


EK. THESIS OVER VIEW 


The remainder of this thesis is organized as follows. Chapter II gives a comprehensive 
overview of DSM, starting by describing a taxonomy for classification of DSM systems, 
followed by a broad discussion on the issues involving such systems and the existing 
memory consistency model proposals. In Chapter III we briefly describe representative 
DSM implementations of each category. In Chapter IV we analyze the Data Merging and 
Lazy Data Merging protocols pointing out their mean features. Chapter V outlines the 
design decisions and the modifications introduced for a portable software implementation 
and we relate the implementation details applied to the Eureka DSM system. In Chapter 


VI we conclude our work and provide directions for further research. 





Ii. BACKGROUND 


This chapter gives a comprehensive overview of Distributed Shared Memory 
(DSM) systems principles. We start by describing two taxonomies for classifying DSM 
systems. Further on, we enumerate the major issues that are involved in DSM system 


implementations and give a description of existing memory consistency models. 


A. TAXONOMIES FOR CLASSIFYING DISTRIBUTED SHARED MEMORY 
SYSTEMS 


Shared memory systems cover a broad spectrum, from systems that maintain 





consistency entirely in hardware, to those that do it entirely in software. To present such 
wide spectrum we need some sort of criteria for their classification. We introduce two 
taxonomies for classifying such systems. The first was informally introduced by 


Tanembaum [TN95]. Figure 2 describes this taxonomy as originally proposed. 
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Figure 2: Taxonomy for classifying DSM systems as proposed by Tanenbaum in 
[TN95]. 





The other taxonomy was proposed by Milutinovic [ML95] and has the merit of 
being one of the pioneer attempts for establishing a formal criteria for classifying DSM 
systems. The proposed taxonomy adopts two variables for categorizing DSM systems: 
DSM implementation level (Hardware, Software or Hybrid) and the DSM algorithm 
(SRSW, MRSW, MRMW). The following paragraphs will briefly introduce this taxonomy 


and some of the issues involved in its application. 


I. DSM Implementation Level 


There are three basic types for classifying systems under this criterion: Hardware, 
Software and Hybrid DSM systems. Figure 3 summarizes the three implementation levels 
in which current DSM systems fall. 

The implementation level affects both the programming model as well as the 
overall system performance. While hardware solutions bring transparency and small access 
latencies, software solutions can better exploit the application behavior and present more 


flexibility, especially as a source for experiments of new concepts and algorithms. 


a. Hardware. 


b. Software. 
(1) Operating System. 


e Inside the kernel. 
e Outside the kernel. 


(2) Runtime Library Routines. 


(3) Compiler Inserted Primitives. 


c. Hardware/software combination. 





Figure 3: DSM implementation levels. 


1. As described in [ML95]. 

















a. Hardware Level DSM 


Most of the hardware DSM systems concentrate in one of three categories: 
¢ Cache Coherent Non-Uniform Memory Architecture (CC-NUMA). 

¢ Cache-Only Memory Architecture (COMA). 

¢ Reflective Memory System Architecture (RMS). 

On CC-NUMA architectures the shared virtual address space is statically 
_ distributed across the clusters. It is accessible by the local processors and by processors 
from other clusters with distinct access latencies. In general, the DSM mechanism relies on 
directories with organizations varying from a full-map storage (DASH) to dynamic 
Structures (1.e., linked lists, fat trees, etc.). 

COMA architectures provide the dynamic partitioning of data in the form of 
distributed memories organized as large second level caches. Distinct from the previous 
architectures, there is no physical home location for any particular data item. These 
architectures allow a particular data item to be simultaneously replicated on many caches. 
The typical architecture consists of hierarchical network topology (i.e., KSR-1). 

RMS architectures adopt a hardware-implemented update mechanism. This 
is achieved by declaring some parts of the local memory on each cluster as shared and 
mapped into a common virtual address space. Coherence maintenance is enforced by full 
replication. This is achieved by broadcasting/multicasting every write operation to the 
other units. The result is a high cost for write operations. Typical examples of these 


architectures are Scrannet and Encore’s RMS. 


b. Software Level DSM 


As described in Figure 3 software level implementations can be divided 
into three basic categories: compiler implementations, user-level runtime packages and 
operating system level implementations. The latter can further be subdivided into inside/ 


outside the kernel. 








Operating system (inside the kernel) implementations are incorporated to 
the actual operating system kernel. The advantage of this approach is that the semantics of 
the underlying operating system can be preserved. An example of such system is Mirage 
[ML95]. 

Operating system (outside the kernel) implementations are those on which 
the same mechanism can be accessed by both the user and the kernel. An example of this 
systems can be found in Clouds. 

User level runtime packages consists of libraries that are linked to the actual 
application programs. Examples of such systems are Munin, Treadmarks, Quarks, 
Midway, etc. 

For compiler implementations the shared memory paradigm is applied at the 
language level. For these applications, shared data is structured into logical units of sharing 
and accesses to these shared elements are automatically converted into synchronization and 


coherence primitives. 


c. Hybrid implementations 


Hardware DSM implementations have the disadvantage of limited 
flexibility, especially for enforcing multiple coherence protocols. They also present some 
limitations on scalability due to the requirements for maintaining directories, for example. 
On the other hand, software implementations have lack on performance due to the larger 
granularity and the latency for exchanging messages. 

Hybrid implementations try to combine both levels to resolve some of these 
problems. Typical examples for these architectures are the MIT Alewife in which the full- 
map directory is implemented part in hardware and part in software and the Stanford 


FLASH, which provides the means for implementing multiple coherence protocols. 


d. Factors that affect the DSM implementation level 


For better understanding we divide this item in two topics: 


¢ System architectural configuration; and 
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¢ Shared data organization. 

System architectural configuration affects the system performance, since it 
can offer or restrict a good potential for parallel processing of requests related to the DSM 
management. It also strongly affects the scalability. Since a system applying a DSM 
mechanism is usually organized as a set of clusters connected by an interconnection 
network, architectural parameters include: 


¢ Cluster configuration (single/multiple processors, with/without, shared/ 
private, single/multiple level caches, etc.); 


¢ Interconnection network (bus hierarchy, ring, mesh, hypercube, specific 
LAN, etc.). Almost all types of interconnection networks found in 
multiprocessors and distributed systems have also been used in DSM 
implementations. The software-oriented systems are, in general, built on 
top of Ethernet (Munin, Treadmarks, Eureka) or ATM (Midway), while 
topologies such as bus hierarchies (DASH, FLASH, Alewife), meshes or 
rings (Memnet) are typical for hardware or hybrid solutions. 


Shared Data organization represents the global layout of shared address 
space, as well as the size and organization of data items in it, and can be distinguished as: 


¢ Structure of shared data (e.g., non structured or structured into objects, 
languages types, etc.); 


¢ Granularity of coherence unit (e.g., word, cache, block, page, complex 
data structure, etc.). 


Hardware solutions generally deal with non-structured data objects 
(typically cache blocks), while many software implementations tend to use data items that 
represent logical entities in order to take advantage of the locality naturally expressed by 
the application. On the other hand, some software solutions, based on virtual memory 
mechanisms, organize data in larger physical blocks (pages), counting on the coarse-grain 


sharing. 
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2. DSM Protocols 


This classification deals with the possible existence of multiple copies of a data 
item. It also considers access rights to these copies. The complexity of maintaining 
coherence among different copies of a data item varies strongly with the algorithm. Many 
policies have been proposed, the majority of them adopting “multiple readers single 


writer” (MRSW) algorithms. Figure 4 depicts this classification criteria. 


a. SRSW (Single Reader/ Single Writer) 
e Without Migration. 
e With Migration. 


b. MRSW (Multiple Readers / Single Writer). 
c. MRMW (Multiple Readers / Multiple Writers). 





Figure 4: The second criterion: the DSM algorithm.” 


Basically, there are three parameters closely related to the algorithm: 


¢ Responsibility for the DSM management (e.g., centralized, distributed/fixed, 
distributed/dynamic). 


¢ Consistency Model (e.g., strict, sequential, causal, weak, release, etc.). 

¢ Coherence policy (e.g., write-invalidate, write-update, type-specific, etc.). 

The responsibility for the DSM management can be either centralized or 
distributed. Centralized management is easier to implement, but suffers from the lack of 
fault tolerance and can become a performance bottleneck. On the other hand, distributed 
management policy can be defined either statically (fixed) or dynamically, eliminating 
bottlenecks and providing scalability. In the case of a static management each manager is 


assigned a predetermined subset of the data space, which remains fixed throughout the 


2. The classification of Figure 4 is equivalent to the PRAM classification as defined in [IKGGK94]. 
The equivalences are SRSW and EREW, MRSW and CREW and MRMW and CRCW. 
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lifetime of an application. In contrast, the dynamic approach? management responsibility 
shifts from a node to another at runtime. 

The consistency model defines and enforces acceptable ordering of accesses to 
shared data by different processes such that at pre-specified points of time the state of 
shared data (as viewed by each individual process) is “correct” in some process-defined 
sense. Also, in [AH90] a consistency model is described as a contract between the software 
and the hardware in which, by this contract, the software agrees to some formally specified 
constraints, and the hardware agrees to appear consistent to at least the software that obeys 
those constraints [TN95]. | 

Stricter forms of memory consistency typically increase the memory access latency 
and the bandwidth requirements. More relaxed models result in better performance at the 
expense of a higher involvement of the programmer in synchronizing accesses to shared 
data [ML95]. 

The memory coherence protocol determines when and how all the existing copies 
of the data items existing at one site will be updated or invalidated on the other sites. © 

It is important now to observe the difference among memory coherence and 
memory consistency. Memory coherence examines in isolation each memory location and 
the sequence of operations on it, without regard to other locations. Memory consistency 


deals with writes to different locations and their ordering [TN95]. 


B. ISSUES ON DISTRIBUTED SHARED MEMORY SYSTEMS 


1. Granularity of Sharing 


The issue of granularity of sharing can be better addressed by answering the 
question: “How large should the shared data block be?”. Determining the right granularity 


largely depends on the problem domain and there is no general solution. Each extreme has 


3. Also known as adaptive partitioning. 
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its advantages. The following table gives examples of some granularity choices for existing 
systems. 

It can be observed that hardware or hybrid DSM systems adopt a much smaller 
granularity. Operating system implementations are, in most of the cases, restricted to use 
the virtual block size (or multiples thereof) as its unit of reference, while software runtime 
library DSM systems have the choice of using compiler support (e.g., Midway) or to 
explicitly declare shared variables (e.g., Munin, Treadmarks, Quarks, etc.) to avoid the 
burden of being forced to explicitly use pages as the unit of coherence. In the following 


paragraphs we will analyze some of the issues that involve different block sizes. 


Table 1: Block Granularity on DSM Systems 


Object Pe | 32 bytes 


FLASH 128 bytes 





For systems which use fixed block sizes (page-based systems and hardware 
implementations) one would like to keep the communication cost as low as possible. There 
are two ways to achieve this goal: using a faster transmission medium or reducing the block 


size as the equation below suggests. 


[latency per byte = fixed message Startup cost + block size 


Transmission Media Bandwidth 


Clearly, there are advantages and disadvantages in choosing a coarser block size for 


a DSM system. The biggest advantage of coarse granularity is to reduce the start-up penalty 
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by amortizing it on a larger number of data bytes. This property is especially important 
because many programs exhibit “locality of reference” [TN95], resulting on a implicit 
prefetching of data that could be accessed in the near future. 

Finer granularity, on the other hand, would be preferred in programs that present a 
high degree of sharing. For this type of application larger block sizes will only diminish the 
opportunity of concurrent access to different parts of a shared data block. 

In summary, larger blocks are ideal for applications which exhibit low degree of 
sharing and good locality of reference when compared to the computational granularity, 
since it minimizes the fixed cost per word transferred. Meanwhile, if the degree of sharing 
is relatively high when compared to the computational granularity then a smaller block size 
becomes more attractive. The next section addresses this issue when a high degree of 


sharing is present. 


2. False Sharing 


False sharing arises because a DSM system cannot identify updates to individual 
bytes when protecting regions of memory, while the memory hardware provides control 
only at the granularity of a data block. Therefore, false sharing occurs when two or more 
processes update distinct portions of the same data block. 

False sharing poses a problem for systems that maintain consistency at the 
granularity of entire pages or entire objects: every time a thread modifies a page of a shared 
object, these systems must invalidate or update all copies. It is a particularly serious 
problem for two reasons: | 

¢ The consistency units are large, so false sharing is very common; and 


¢ the latencies associated with detecting modifications and communicating are 
relatively big, resulting in unnecessary faults and messages that are particularly 
expensive. 


There are two extremes of applications addressing false sharing. 


¢ The problem is ignored (i.e., [VY [KL88]): the consequences are that pages will 
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“ping-pong” back and forward between processors as can be seen from Figure 5. 


¢ We allow multiple writers to the same data block and rely on the programmer 
to ensure that no two processors are writing to the same memory location. This 
approach is known as Multiple Writers and is used by many existing 
implementations (Munin, Treadmarks, and CarlOS). 


Processor P1 Processor P2 


Y =4; 
end loop; 


Data Block 


Y=4 
Figure (a) - Invalidate Sequence Figure (b) - Update Sequence 


Before performing a write the _ Every write to a shared data 
process invalidates the remote block will result on an update 
copies. The consequence is that message to all other processors. 
data blocks will ping-pong 

between P1 and P2. 





Figure 5: False Sharing. 


The problem of false sharing is more extensively addressed in [BS93] in which an 
attempt for quantifying the problem is given. It is interesting to mention that the cost of 
False Sharing is small for programs that present a small degree of sharing while the cost 
becomes prohibitive whenever a high degree of sharing is present. At first glance the naive 


solution would be to reduce the coherence unit size. This would reduce or even eliminate 
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the problem. However, for applications in which the data is migratory by nature if this 
reduction is too large, exactly the opposite happens; the cost gets larger with a smaller 
block size due to the increase on the number of operations required [BS93]. 

In summary, the right size for the coherence unit to avoid false sharing without 
imposing an increase on the number of operations is highly dependent on the degree of 


sharing of the application type. 


3. Synchronization 


Most parallel applications running on a Shared Memory Multiprocessor rely on a 
set of synchronization operations to enforce mutual exclusion and avoid race conditions. 
For a multiprocessor environment Test-and-Set operations have a reduced cost and are 
widely used to implement atomic transactions. On the other hand, for a software DSM 
system this approach is unacceptable [MU94] due to limitations on network bandwidth. 

Generally, DSM systems rely on explicit synchronization mechanisms to enforce 
consistency on shared data. One alternative implementation is to provide Synchronization 
Manager(s) which will handle the allocation/dealocation of synchronization objects and the 
corresponding operations (e.g., Acquire/Release of locks). This approach reduces network 
traffic at the expense of centralized control per synchronization object and is commonly 
named as “centralized locking’”’. 

Shared-variable systems like Munin and Midway rely on “distributed locking” 
schemes. More precisely, Munin provides a directory for synchronization variables in 
which each lock is mapped. On a lock request if the lock is Local (owned by the local 
processor) it is released to the local thread, otherwise its owner is located through 


consulting a Synchronization Directory. 


4. Heterogeneity 


This is an issue that presents no easy solution. Sharing data between two machines 
with different architectures, and assuming that these two machines may not even use the 


same representation for basic data types would seen very difficult [BL91]. 
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Some solutions were presented for this problem. In Mermaid [ZH91], memory is 
shared in pages and each page can contain only one type of data. Whenever a page is moved 
between two architecturally different systems, a conversion routine converts the data in the 
page to the appropriate format. An alternative proposal is mentioned in [BL91]. It consists 
of organizing the shared data as variables or shared objects in the source language and 
relying on a DSM compiler to add conversion routines to all accesses to shared memory. 
An example of this approach is found in the implementation of Agora. In this system 
memory is structured as objects shared among heterogeneous machines. 

Albeit the solution of the heterogeneity problem allows the addition of more nodes 
to DSM systems, it presents a drawback of requiring data conversions on every transaction 
among heterogeneous platforms. In general, this overhead out-performs the benefits 


[NL91]. 


5. Coherence Protocol 


The choice of the coherence protocol is related to the granularity of shared data. For 
very fine grain data items, the cost of an update message is approximately the same as the 
cost of an invalidation message. Therefore, the update protocol is typical for systems with 
word-based coherence maintenance and invalidation is used in coarse grain systems. The 
efficiency of an invalidation approach is increased when the sequences of reads and writes 
to the same data item by various processors are not highly interleaved [ML95]. 

In spite of some drawbacks, update protocols are promising in one respect: the 
number of messages involved. It directly reflects the message passing nature of the 
underlying system. Updates can be thought of as sending a message containing the state 
that the application wishes to share among the different parts of the program. Hence, we 
can expect that when used carefully, the update protocol to perform as well as any message 
passing implementation of an application [AAL92]. The drawback of this approach is that 


updates may be sent to nodes that are not on demand to the updated value. To avoid this 
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problem and, consequently, reduce the number of update messages, two new consistency 
models were proposed: Lazy Release Consistency and Entry Consistency. 

In contrast, the invalidate protocol involves two extra messages to achieve the same 
effect: the invalidate message to a node caching a given page and the get message for the 
same page on a subsequent access by the node. 

There are also some proposals for a hybrid solution as can be seen in [DKCZ93] in 
which a hybrid coherence protocol is proposed for Lazy Release Consistency. 

A fourth alternative is proposed for the Clouds operating system [MU94]. This 
approach uses direct association of locks to govern the access to shared cache blocks, 
allowing data associated with the lock to be sent to the requester along with the lock 
granting. Upon a release of a lock, the associated data is sent back (if modified) to global 


memory. 


C. DSM MEMORY CONSISTENCY MODELS 


In this section we introduce the more well known consistency models and 
enumerate some of their strengths and weaknesses. It is important to observe that the 
models are listed in increasing order of flexibility. 

_ Before we proceed, we need to define what it means to perform a memory request 
and a memory load. The following formal definitions are extracted from [GLLG90]. In both 
definitions P; refers to processor 7: 

Definition 1: Performing a Memory Request 

A LOAD by P; is considered performed with respect to P, at a point in time when 
the issuing of a STORE to the same address by P, cannot affect the value returned by the 
LOAD. A STORE by P; is considered performed with respect to P, at a point in time when 
an issued LOAD to the same address by P, returns the value defined by this STORE (or a 
subsequent STORE to the same location). An ACCESS (LOAD/STORE) by P; is performed 
when it is performed with respect to all processors. 


Definition 2: Performing a LOAD Globally. 
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A LOAD is globally performed if it is performed and if the STORE that is the source 
of the returned value has been performed. 

Definition 3: Performing a STORE Globally. 

A STORE is globally performed if it is performed and if all immediate subsequent 
LOADS from the corresponding memory location return the value issued by the STORE. 

After the three above definitions we are ready to describe some of the existing 


memory consistency models. 


1. Strict Consistency Model 

This is the most stringent consistency model. It is defined by the following 
condition: 

“Any read to a memory location x returns the value stored by the most 
recent write operation to x.” [TN95] 

In other words, when memory is strictly consistent, all writes are instantaneously 
visible to all processes and an absolute global time order is maintained. If a memory 
location is changed, all subsequent reads from that location will see the new value, no 
matter how soon after the change the reads are done and no matter which processes are 
doing the reading and where they are located. This type of memory consistency is easily 
achieved on a uniprocessor system, but it is almost impossible to guarantee on a 


multiprocessor environment, without explicity synchronizing on all STORE operations. 


2. Sequential Consistency Model 


The sequential consistency is defined by Lamport [LAM79] as follows: 

“A system is sequentially consistent if the result of any execution is the 
same as if the operations of all the processors were executed in some sequential order, and 
the operations of individual processor appear in this sequence in the order specified by its 
program...” 

The following are the sufficient conditions for providing sequential consistency: 


* Before a LOAD is allowed to perform with respect to any other processor, all 
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previous LOAD accesses must be globally performed and all previous STORE 
accesses must be performed; and 


¢ Before a STORE is allowed to perform with respect to any other processor, all 
previous LOAD accesses must be globally performed and all previous STORE 
accesses must be performed. 


Figure 6 exemplifies a program that runs concurrently on two distinct processors. 


Processor 1 Processor 2 


lx=1; 2.y=1; 
3. if (y == 0) 4. if (x ==0) 


5S. kill P2; 6. kill Pl; 


Valid outcomes under a sequentially consistent program would be: 


Processor P1 or Processor P2 being killed, none of them being killed, 
but never both processors. 





Figure 6: An example of a sequentially consistent program. 





In summary, Sequential Consistency requires that the distributed memories in a 
DSM have the same consistency properties as a time shared uniprocessor, which requires 
that the global state of memory be consistent after every read or write to shared memory. 


This requirement imposes severe restrictions on possible performance optimizations. 


Rel 
Release 
| lock 


Figure 7: Communications on the Sequential Consistency Model. 4 





4. Dotted lines represent the intervals in which the processor should stall. 
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The reason why sequential consistency is inhenritly inefficient can be observed in 
Figure 7. Every write operation forces the system to stall until the corresponding data block 
is either invalidated or the updates are propagated to the other processors. More formally, 
every STORE operation stalls the processor until it is performed. 

Therefore, when both processors P1 and P2, in Figure 7, have cached the same 
copies of the variables X and Y within a critical section, each write must be delayed until 
the previous write completes even within a critical section. This will, besides requiring a 
large number of messages, have a large delay due to the periods processor P1 must stall 
(represented by dotted lines on Figure 7) while communicating. The use of sequential 


consistency still requires synchronization when preemptive scheduling is used. 


3. Processor Consistency 


The concept of Processor Consistency was introduced by Goodman [GVW89]. It 
requires that writes issued from a processor may not be observed by other processors in any 
order other than the one in which they were issued. Specifically, this model relies on the 
use of explicit synchronization to guarantee strict event ordering. The following conditions 
are necessary for processor consistency: 


¢ Before a LOAD is allowed to perform with respect to any other processor, all 
previous LOAD accesses must be performed; and 


¢ Before a STORE is allowed to perform with respect to any other processor, all 
previous accesses (LOADS and STORES) must be performed. 


The above conditions allow reads following a write to bypass the write. To avoid 
deadlock, the implementation should guarantee that a write that appears previously in 
program order will eventually perform [GLLG90]. Here we need to demonstrate the subtle 
difference between Processor Consistency and Sequential Consistency. We will use the 
example of Figure 8 to illustrate that. 

According to Ahmad [AHJ90] for a sequentially consistent program outcome to be 
legal it must obey two constraints: | 


¢ Program order must be maintained; and 
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¢ Memory coherence must be respected. 


Processor Il Processor 2 


i e—s bh 2.y=1; 
3. if (y == 0) 4. if (x == 0) 
5. kill P2; 6. kill Pl; 


A valid sequence for a processor consistent program would result on both processors 
P1 and P2 being killed. This outcome is possible since processor consistency allows 
reads to bypass writes. So y == 0 on P1 and x == 0 are tmie statements. 





Figure 8: Example of a processor consistent program. 


Processor Consistency, in contrast, is more relaxed since it only requires that writes 
issued from a processor may not be observed in any order other than that in which they were 
issued. As can be observed, Processor Consistency may not issue the correct result if the 
programmer is expecting sequential consistency, thus requiring the use of explicit 


synchronization by the programmer to enforce sequential consistency. 


4. Causal Consistency Model 

Causal Consistency can be defined as: 

“An execution i causal memory is correct if the value returned by each read 
operation in the execution belongs to a set of correct values for that location.” 

More precisely, writes that are potentially causally related must be seen by all 
processes in the same order. Concurrent writes may be seen in a different order on different 
machines [TN95]. For causally related events we mean that an event A is caused by or 
influenced by event B according to the definition of event ordering from Lamport 
[LAM79]. Concurrent events are those events that are not causally related. Figure 9 
describes a sequence of events that are Causally consistent but violate sequential 


consistency. In this figure the symbol W (x) a represents a STORE operation where the 
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value “a” is stored on the variable “x”. By R (x) a we mean that a LOAD of variable “x” 


has the returned the value “‘a”’, 


W (x) 1 W (y)2 R (z) 0 


€14 


R (x) 1 





Figure 9: Causal Consistency Model. 


As can be observed, the “R (x)” operation on processor P2 has returned the value 


“0” . This represents that both events “e,;” and “e,” are concurrent, while events “e,3” is 
causally related to event “e)>”. 


Causal Consistency was implemented on the Clouds Distributed Operating System. 
This implementation uses a vector timestamp [LAM79] to capture the evolving causal 
relationships. This implementation of Causally Consistent memory will use invalidates to 
resolve inconsistencies and, although more relaxed than Sequential Consistency, it does not 


resolve the problem of false sharing. 


5. Weak Consistency Model 


A consistency model can be derived by relating memory requests ordering to 
synchronization points in the program [GLLG90]. The delays imposed by the sequential 
consistency model are unnecessary if appropriate synchronization mechanisms are 
enforced. Given that all synchronization points are identified, we need only to ensure that 


memory is consistent at those synchronization points. This scheme has the advantage of 
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permitting multiple memory accesses to be pipelined [AH90]. Sarita and Hill define Weak 
Ordering as follows: 
“Hardware is weakly ordered with respect to a synchronization model if and only if 
it appears sequentially consistent to all software that obey the synchronization model.” 
The synchronization model defined in [AH90] is named Data-Race-Free-0 (DRFO) 
and is closely related to the happens-before relation [LAM79] which can be defined as the 


ireflexive transitive closure of program order and synchronization order: 


hb. = ( DO ss UF SOs \* 


where po represents program order and so, synchronization order, respectively. 

The complete definition for the synchronization model DRFO is given below: 

A program obeys the synchronization model DRFO if and only if: (1) all 
synchronization operations are recognizable by the hardware and each accesses exactly one 
memory location, and (2) for any execution of the idealized system all conflicting accesses 
are ordered by the happens-before relation corresponding to the execution. In this 
definition, two accesses are said to conflict if they access the same location and they are not 
both reads. The figure below describes an example of an execution that obeys the DRFO 


model. 





Figure 10: An example of DRFO [AH90]. 
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In [GLLG90] a slightly different set of conditions is listed, which for consistency 
reasons we will adopt for the remainder of this document: 


¢ Before an ordinary LOAD or STORE access is allowed to perform with respect 
to any other processor, all previous synchronization accesses must be performed; 


¢ Before a synchronization access is allowed to perform with respect to any other 
processor, all previous ordinary LOAD and STORE accesses must be performed; 
and 


¢ Synchronization accesses are sequentially consistent with respect to one 
another. 


The first and second rules assure that the instructions inside the critical section stay 
inside and those outside stay outside as observed by any other processor. The third rule 
assures that the synchronization variables can create critical sections. The example below 
depicts a correct and an incorrect outcome for a Weakly Ordered program. 

As can be seen from Figure 1 la there are no guarantees for the outcome before the 
synchronization point. In contrast, after the synchronization is performed the local memory 
should be brought up to date returning the most recently values written to it. Therefore, on 
Figure 11b it can be noted that the value returned by processor P2 is invalid. 

In summary, Weak Consistency assures the correctness of parallel programs by 
placing tight restrictions on synchronizing instructions and loose restrictions on ordinary 


instructions. 


Pl: x=1,x =2 Synch Pl: x= 1,x =2 Synch 
—$—— $$$ $—_______» 


P2; Synch a= x. 


a=x,b=x Synch : Synch a= x. 


nn ae ne at 


Outcomes: Outcomes: 
P2 ->a=1,b=2 => CORRECT. P2 ->a=1=> INVALID 
P3 ->a=2,b= 1=> CORRECT. P3 -> a= 2 => CORRECT 


(a) (b) 





Figure 11: Weak Consistency valid (a) and invalid (b) sequences of events. 
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6. Release Consistency Model 


Release Consistency is an extension of weak consistency that exploits the 
information about acquire, release, and non-synchronization accesses. To better describe 
the designation below we must describe the notions of competing and conflicting accesses. 
Two accesses by one or more processors are conflicting if they are to the same memory 
location and at least one of the accesses is a STORE. If a pair of conflicting accesses execute 
simultaneously, causing a race condition, then such accesses form a competing pair. If an 
access is involved in a competing pair, then the access is considered a competing access 
[GLLG90]. 

The following gives the conditions for ensuring release consistency: 


¢ Before an ordinary LOAD and STORE access is allowed to perform with respect 
to any other processor, all previous acquire accesses must be performed; 


¢ Before a release access is allowed to perform with respect to any other 


processor, all previous ordinary LOAD and STORE accesses must be performed; 
and 


¢ Special accesses (acquire and release) are processor consistent with respect to 
one another. 


In the above definition ordinary accesses are represented by non competing 
accesses and special accesses denote the competing ones. 

Therefore, release consistency relaxes the constraints of sequential consistency in 
three ways: 


¢ Ordinary reads and writes can be buffered and pipelined between 
synchronization points; 


¢ Ordinary reads and writes following a release do not have to be delayed for the 


release to complete (1.e., a release only signals the state of past accesses to shared 
data); and 


¢ An acquire access does not have to delay for previous ordinary reads and writes 
to complete. [CBZ91]. 


When compared with Weak Consistency we can observe that four of the ordering 
restrictions present in Weak Consistency are not present in Release Consistency. First is 


that ordinary LOAD and STORE accesses following a release access do not have to be 
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delayed for the release to complete. Second, an acquire synchronization access need not be 
delayed for previous ordinary LOAD and STORE accesses to be performed. Third, a non- 
synchronization special access does not wait for previous ordinary accesses and does not 
delay future ordinary accesses. The fourth difference lies in the ordering of special 
accesses. For Weak Consistency, the accesses are sequentially consistent while for Release 
Consistency the accesses can be Processor Consistent. 

This model was developed as part of the DASH project and proved effective at 
hiding the effects of memory latency by pipelining invalidation messages caused by writes 
to shared data [CBZ91]. Figure 12 depicts the gains in the communication costs that are 
achievable through this approach. As can be observed, the processor needs only to stall at 
the time of a release operation, STORES are pipelined. Therefore, it introduces a large 


optimization when compared to a sequentially consistent program (Figure 7). 





Figure 12: Release Consistency model. 


Besides the generic model adopted for the DASH implementation there are two 
other variants for Release Consistency: Eager Release Consistency (e. g., Munin) and Lazy 


Release Consistency (e.g., Treadmarks). 


a. Eager Release Consistency 


When a thread performs a release, it stalls until all modifications to shared 
data have been performed (invalidated/updated). This new scheme buffers writes to shared 


data until the subsequent release, at which point it flushes the buffered writes. Ideally, this 
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strategy reduces the number of messages transmitted from one per write to one per critical 


section when there is a single replica of the shared data. 


buffer update messages Ack for (x,y) 


Single Update Msg ———» 





Figure 13; Eager Release Consistency model. 


As we observe in Figure 13, this approach increases the latency of a release 
when compared to the Release Consistency Model. Nevertheless, the reduction in the 
number of messages may outweigh the effect of higher release latencies. Carter [CBZ92] 
also proposes the use of Update instead of an Invalidate-based coherence protocol since the 
above approach only solves the cost of writes, but has no effect on read misses. When the 
ratio of read/write to shared data is relatively high, the effect of read misses can be 
mitigated by using an update-based protocol. This approach is feasible when used in 


combination with the buffered approach as is the case of Munin. 


b. Lazy Release Consistency 


When a thread performs an acquire, all “stale” data is discarded or updated. 
This approach is adopted in the implementation of Treadmarks [ACDB94] and CarlOS 
[KF94]. Compared with Eager Release consistency it causes fewer messages to be 
exchanged. At the time of a lock release, Munin sends messages to all processors which 
cache data modified by the releasing processor. In contrast, in Lazy Release Consistency 


(LRC) messages only travel between the last releaser and the new acquirer. 
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LRC is somewhat more complicated than eager release consistency. After a 
release, Munin can forget about all modifications that the releasing processor made prior to 
the release. This is not the case for LRC, since a third processor may later acquire the lock 
and need to see the modifications. 

More formally, in Lazy Release Consistency the propagation of 
modifications is further postponed until the time of the acquire. At this time, the acquiring 
processor determines which modifications it needs to see according to the definition of 
Release Consistency. To do so, LRC uses a representation of the happened-before relation 
introduced by Adve and Hill [AH90]. Release Consistency requires that before a processor 
may continue past an “acquire”, all shared accesses that precede the acquire according to 
happens-before-1 relation must be performed at the acquiring processor. LRC guarantees 
that this property holds by propagating write-notices on the message that affects a release- 
acquire pair. A write-notice is an indication that a page has been modified in a particular 
interval, but it does not contain the actual modifications. Each new interval begins with 
each special access performed by the corresponding processor. Such intervals are, in turn, 
used inside Vector Clocks [LAM79] to enforce the happened-before relation among the 
processors. 

On an acquire, the acquiring processor, P;, sends its current vector 
timestamp to the previous releaser, P;. Processor P; uses this information to send to P; the 
write-notices for all intervals of all processors that have performed at P; but have not yet 


performed at P;. Releases are pure local operations in LRC and no messages are exchanged. 


For the case of an update coherence protocol the acquiring processor 
updates all pages for which it received write-notices. In contrast, for an invalidate protocol, 
the acquiring process invalidates all pages for which write-notices were received. Figure 


14 gives an example of both update and invalidate protocols. 
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Processor 1 Processor 2 Processor 3 

acquire_lock acquire_lock acquire_lock 
x=5 y= c=y 

release_lock release_lock release_lock 


P1 [x] acq w(x)5 rel 


inv(x) 
a ac rel 


inv (xy) feo 
P3 [y] 
r rel 


a ee | Pa eG 2 ee (om 


(a) Invalidate Protocol 
P1 [x] acq w(x)5 rel 


update (x) 
P2 [y] | 
: acq w(y)6 rel 
‘i update (x,y) 


P3 [yl acq r(y)6 rel 


(b) Update Protocol. 





Figure 14; LRC Invalidate protocol (a) and LRC update protocol (b). 


Figure 15 compares LRC with the generic version of Release Consistency. 
As can be observed, Lazy Release Consistency will have fewer messages, but the 
implementation of such mechanism will be far more complex. The number of messages 
will also be smaller than implementations of Eager Release Consistency (Munin), since for 
both invalidate and updated protocols it will be required that every process that is in the 


copyset receives a release message. 


7. Entry Consistency Model 
Entry Consistency was introduced with the Midway DSM system [BZS93]. For 


entry consistency, data is only consistent on an acquiring synchronization operation, and 


only the data known to be guarded by the acquired object is guaranteed to be consistent at 
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the time of the acquire. Communication between processors occurs only when a processor 


acquires such synchronization objects. 


W(x)Rel Release Consistency 


YN  Y 


Lazy Release Consistency 


Acq W(x) Rel 


Figure 15: Comparison between RC and LRC models. 


W(x)Rel 





Formally, a memory exhibits entry consistency if it meets the following conditions: 


¢ An acquire access of a synchronization variable is not allowed to perform with 
respect to a process until all updates to the guarded shared data have been 
performed with respect to that process; 


¢ Before an exclusive mode access to a synchronization variable by a process is 
allowed to perform with respect to that process, no other process may hold the 
synchronization variable, not even in nonexclusive mode; and 


¢ After an exclusive mode access to a synchronization variable has been 
performed, any other processor’s next nonexclusive mode access to that 
synchronization variable may not be performed until it has performed with 
respect to that variable’s owner. 


The first condition states that when a process does an acquire, the acquire may not 


complete until all the guarded shared variables have been brought up to date. 
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The second condition states that before updating a shared variable, a process must 
enter a critical region in exclusive mode to ensure that no other process is trying to update 
it at the same time. 

The third condition declares that if a process wants to enter a critical region in 
nonexclusive mode, it must first check with the owner of the synchronization variable 
guarding the critical region to fetch the most recent copies of the guarded shared variables. 

Although entry consistency enables the use of low overhead consistency 
mechanisms, writing an entry consistent program requires more work than writing one on 
a more stronger model. For example, every synchronization object must be identified; 
every use of such an object must be explicit; every shared data item must be associated with 
a synchronization object; and synchronization accesses should be qualified as read-only or 
read-write for best performance [BZS93]. 

In summary, entry consistency requires: 

¢ Shared data to be accessed inside critical section; 


¢ All shared data has to be associated with a single synchronization variable (e.g., 
a lock); and 


¢ When a lock is acquired (entry to a critical section), only those variables 
associated with lock are made consistent. 


In the next chapter we describe the main features of existing systems focusing on 


both hardware and software implementations. 
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Ill. DSM SYSTEMS OVERVIEW 


In this chapter we review the main features of some existing DSM implementations. 
We divide the reviewed systems into hardware and software implementations. Our goal is 
to point out the strengths and weaknesses of each individual system. These features will be 


recalled again when our approach is described on Chapters IV and V. 


A. HARDWARE IMPLEMENTATIONS 


I. KSR-1 


The KSR-1 implements the “ALLCACHE” [RO92] memory model. The 
“ALLCACHE” is a hardware message-based distributed virtual memory system which 
enforces the Sequential Consistency Memory Model. Each processor is associated with a 
32 MB cache unit, all of which are tied together by a very fast slotted-ring communications 
mechanism, across which a single address space is defined. The KSR-1 architecture 
exploits locality of reference by organizing a number of “ALLCACHE” Engines in a 
hierarchy. At the lowest level are the ALLCACHE Group:0s (with 32 processors in each 
group) which consists of the ALLCACHE Engine:0s and the local caches associated with 
them. Therefore, an ALLCACHE Engine:0 contains the directory which maps from 
addresses onto the set of local caches within its group. An ALLCACHE Engine: 1 includes 
the directory which maps from addresses into its constituent set of ALLCACHE Group:0s. 
The ALLCACHE Engine is constructed with a “fat-tree” topology so that the bandwidth 
increases at each higher level of the ALLCACHE Engine. 

The distinctive feature in this design becomes apparent when data is required which 
is not located within the local memory and a request is generated: what is returned is not 
simply the data, but the address as well. They are returned because there are two types of 
addresses on the system: System Virtual Address (SVA) and Context Address (CA). The 
SVA is a 64-bit global system-wide reference of any given location (at byte level) in the 


memory system. The CA type are the addresses referenced by each individual task. CA 
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addresses are translated into a SVA by the processor using a translation table managed by 
the ALLCACHE Engine. CA memory is allocated by segments that map into SVA 
segments. Segments in different CA spaces may be mapped to the same segment in SVA 
space, thus allowing sharing between two processors. In these processors each page is a Set 
of 128 subpages of 128 bytes each (total 16 KB). Pages (16 KB) are used as allocation unit 
and subpages (128 bytes) as coherence unit for the system. 

Memory within the KSR-1 is formed within a scalable hierarchy based on the 
average latency to return an address from a given initial location. Physically, there are four 
levels of memory access: 

¢ A 512 KB subcache for each processor. 

¢ A 32 MB cache for each processor cell. 

¢ A 996 MB remote cache, on the same local ring:0. 

¢ A 31744 MB remote cache, on distinct rings (ring:1). 

The communication rings (interconnect) are hierarchically structured, with a first- 
level ring (Ring:0) grouping 32 processors together, and a second-level ring (Ring:1) 


grouping together up to 34 first-level rings. Figure 16 describes this hierarchy. 





Figure 16: The KSR-1 ALLCACHE Hierarchy. 
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A subpage fault (128 bytes) will generate a request which is placed into an open slot 
on the communications ring. The slot is matched against the cached elements associated 
with every processor on the local ring (ring:0). If one of the processors is the owner of the 
data element requested it will put the data plus address on the open slot, otherwise this 
request 1s forwarded through the Ring Routing Cell (RRC) to the next hierarchical level 
(ring:1). 

The coherence protocol enforced by this architecture is write-invalidate and a 
snoop read-broadcast [HN93]. Therefore, at any time there is a unique block owner on the 
system. Whenever a task tries to write to a location, the ownership of that subpage is 
transferred to that processor and an invalidate message is sent to other processors that 
currently cache copies of that data. 

The system provides software instructions that allows for remote data store on other 


processors and also issues prefetch calls retrieving data before it is actually needed. 


2. DASH 


The DASH architecture’s main feature is the introduction of a new consistency 
mechanism “Release Consistency”. As for other hardware implementations in which the 
unit of coherence is small (16 bytes), DASH also adopts a write-invalidate coherence 
protocol. 

The DASH system consists of a two-level, hierarchically organized structure. At 
the top level, the system consists of a set of processing nodes (clusters) connected through 
a pair of wormhole meshes (Figure 17) where each processing node consists of four 
processors linked through a bus-based connection (Figure 18). 

Intra-cluster cache coherence is implemented using a snoopy bus-based protocol, 
while inter-cluster coherence is maintained through a distributed directory-based scheme 


by implementing an invalidation-based coherence scheme. It is the directory responsibility 
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to summarize the information for each memory line, and to specify the clusters that are 


currently storing it [LLJN92]. 





Figure 17: DASH High Level Structure. 


The DSM mechanism for the DASH prototype implements a MRSW Type 
algorithm. Therefore, each memory location can be in one of three states: 
¢ Uncached: not cached by any processing node at all; 


¢ Shared: in an unmodified state in the caches of one or more nodes (Multiple 
Readers); or 


¢ Dirty: in a modified state in the cache of some individual node (Single Writer). 


Processor = 
1st level and Dcache ae ee Directory 
2 i and 


Sights i Intercluster 
oO OE cae - Interface 





Figure 18: DASH Processing Node. 
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The key part of the DSM mechanism is the Distributed Directory which is 
implemented on all clusters. Each memory location has an assigned Home Directory. Data 
ownership can dynamically change whenever a processor performs a write request on the 
same address managed by the Home directory of another cluster. In this situation the 
memory location becomes dirty and the Home Directory invalidates all remote copies 
cached on remote clusters. Note that while data ownership can dynamically change the 
Home Node for any particular block remains fixed. On a memory request, the Home 
Directory takes one of two actions: if the memory location is dirty it forwards the request 
to the current owner, otherwise the requesting block is included on the copyset list and the 
data is forwarded. The problem with the Directory approach is its limited scalability due to 
the use of a bit vector with 1 bit for each cluster. A possible solution would be the use of a 
limited-pointer directory as used on the FLASH implementation [KOHH94]. 

Besides supporting the Release Consistency model, where writes are pipelined, 
DASH uses software-controlled nonbinding prefetching [LLJN92] to hide the network 
latency effect. DASH also provides efficient Fetch-and-Op primitives to reduce the 


synchronization overhead. 


B. SOFTWARE IMPLEMENTATIONS 


1. Operating System Level - Clouds 


The Clouds operating system belongs to a class of object-based distributed 
operating systems and is built on top of a minimal kernel called Ra. The paradigm 
supported by Clouds provides an abstraction of storage called objects and an abstraction of 
execution called threads. All data, programs, devices, and resources are encapsulated in 
objects. Therefore, objects represent the passive entities of the system. Activity is provided 
by threads, which execute within objects. 

At the conceptual level, an object is a virtual address space. In contrast to 
conventional operating systems, objects in Clouds are persistent and are not tied to any 


thread. Since it does not contain a process, it is completely passive. The contents of each 
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virtual address space are protected from outside accesses so that memory in an object is 
accessible only by the code in that object and the operating system. In summary, each 
object is an encapsulated address space with entry points at which threads may commence 
execution [DCMP91]. 

To allow concurrent execution of more than one computation in the same object, 
the system provides a set of shared memory style synchronization primitives. The unit of 
sharing in Clouds DSM is a segment. Associated with each segment is a node called the 
owner where the segment resides on stable storage. The DSM Server object at the owner 
node is responsible for maintaining the consistency of the segment. | 

To unify synchronization with data transfer, Clouds adopts a “lock-based” 
coherence protocol. In this protocol lock requests (both exclusive and shared) result in the 
page associated with the lock being sent to the requester along with the granting of the lock, 
if and only if the requesting mode is compatible with the current mode for the segment, 
otherwise the request is queued. Upon lock release, the associated page is sent back (if 
modified) to the server. Reads or writes to shared data without explicit locking follow 
single copy semantics that do not allow multiple readers or writers. For this purpose two 
primitives are supported: get and discard. In short, the Clouds DSM system is implemented 
integrated with the operating system providing a low overhead when manipulating segment 
misses. Also, by enforcing consistency at defined synchronization points, this DSM system 
enforces the Release Consistency Model. The lock-based coherence protocol was an 


innovation when compared with existing DSM systems. 
2. Runtime Libraries 


a. Midway 

Midway is a Distributed Shared Memory System that supports the Entry 
Consistency Model. As described in Chapter II, Entry Consistency is a relaxed consistency 
protocol that requires the explicit association of shared data to synchronization objects. 


Upon a release operation the changes to the data associated with the lock are propagated to 
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the new acquirer by sending modifications (diffs) to the data object. To keep consistency, 
each processor stores a set of diffs to the corresponding object. To reduce communication 
traffic and guarantee that all changes are made visible to all processes, each process keeps 
a monotonically increasing counter (logical clock [LAM79]) which is incremented 
whenever a synchronization access is performed. At the acquire time the requester sends 
also its Vector Clock. The lock owner, in turn, forwards all changes that are greater than 
the received timestamp. It is the acquirer’s duty to coalesce all changes and update its diff 
set. If this set’s size becomes greater than the data associated with the lock, the data itself 
is propagated to the new lock owner. 

Lock ownership is defined using a Distributed Queue Algorithm similar to 
the Mach’s shared memory server [FBYR88]. This algorithm is based on the probable 
owner concept. The lock request is sent to the node that is currently the probable lock 
owner. If the node that has received the request does not own the lock anymore it forwards 
the request to the next probable owner, creating a chain of messages. This algorithm has a 
worst case complexity of O (n), where “n” is the number of processes that are accessing 
the same critical section. The drawback is that it has “n” possible points of failure per lock 
transaction. Figure 19 provides an example of this algorithm as originally proposed by 


Florin [FBYR88]. In Figure 19a processor P, performs a lock request. The request is sent 


to the “probable owner”, the node designated as root. Since the lock ownership has already 


changed to P, the root node forwards the request to the next probable owner, P,. Again the 
current owner has altered and the request is forwarded to P3 which then releases the lock 
(Figure 19b). The process is repeated again when P, performs a lock request (Figure 19c ). 


Synchronization objects can be of two types: non-exclusive (data can only 
be accessed on read operations) and exclusive (data can be accessed for both read and write 
operations). The synchronization objects ownership is exclusive. By exclusive ownership 


we mean that each synchronization object, at any time, has an unique owner. Replication 
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of data is only allowed for data whose synchronization access is non-exclusive (read-only 
data). 

Midway also provides two other consistency models: processor and release 
consistency. The idea is to allow the programmer to develop his application initially using 
a stronger consistency model and use Entry Consistency on further refinements based on 


the data access patterns. 


Lock Owner 


(PI) Req. Forward (72 ’ 
Lock Req. 
(a) 





Lock Owner 


Lock Granted 





Figure 19: Distributed queue locking scheme. 


There are two implementations for Midway: VM-DSM (Virtual Memory 
DSM) [BZS93] and RT-DSM (Runtime DSM) [ZSB94]. Although both versions adopt the 
Entry Consistency Model as the basic consistency mechanism, their difference lies on the 
embraced strategy for detecting and collecting writes to shared data in a software-based 
DSM. Both strategies rely on compiler assistance to insert primitives for marking shared 


data objects as dirty. In the VM-DSM approach the coherence is constrained to the 
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corresponding page size. On the other hand, on a RT-DSM the coherence unit is flexible, 
since the compiler inserts write-detection primitives on every store operation. It is 
interesting to consider the difference(s) between the two implementations. We summarize 


below some of the reasons for RT-DSM. 


Table 2: Distinction Between RT and VM implementations 






I. Writes have high overhead since they are detected with a page fault. 
This cost is amortized if there is a large number of writes per page. 










2. The page size is generally too big to serve as a unit of coherence, 
inducing false sharing. As we saw in Chapter II, there is a limit that we 
can reduce the unit of coherence without inducing larger overheads due 


1. It tends t 
Operating System altogether. 


2. RI-DSM directly supports variable sized objects eliminating false- 
sharing and the overhead necessary to accommodate it. 


















3. RI-DSM efficiently provides a detailed update history, which allows it 
to minimize the data transferred to maintain consistent memory. 





In summary, for coarse-grained applications that exhibit little actual 
sharing, a VM-DSM is advantageous. In contrast, for a program that synchronizes 


frequently, the RT-DSM system may have better performance. 


b. Munin 

Munin ({CBZ91], [CBZ92]) is a software DSM system that implements the 
Eager Release Consistency Model. 

The Munin runtime manipulates two major data structures as depicted in 


Figure 20: a delayed update queue and an object directory. The former is used to buffer the 








updates until a release operation is performed. The latter maintains the state of the shared 
data set being used by the local user threads. On the Object Directory all shared variables 
on the same physical page are treated as a number of independent page-sized objects. In 
contrast, variables that are larger than a page are handled as a number of independent data 
objects. 

Instead of the Centralized approach adopted on previous software 
implementations, Munin employs a Distributed Directory scheme similar to the one used 
in DASH. The scheme is enforced with the aid of two concepts: dynamic data ownership 
protocol (viz Midway) and by distributing the state information of write-shared data 


through the copyset nodes. 


Network 





Figure 20: Munin Runtime System [CBZ92]. 


The major goal of this system was to reduce the amount of communication 
needed to support DSM. Towards this purpose, Munin introduced three innovative features 


when compared to previous DSM implementations: 
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(1) Software Release Consistency. Munin implements Release 
Consistency inspired on the DASH project [LLGN92]. The major distinction between the 
two approaches is that Munin buffers the updates until the release is performed, while the 
DASH system pipelines them. Both systems send the updates to all nodes that are known 
to be currently holding a copy of the page. This implementation of Release Consistency 


became known as Eager Release Consistency. 


(2) Multiple consistency protocols. Based on observations on the 
data access patterns, five major types were distinguished: conventional, read-only, 
migratory, write-shared and synchronization. 


¢ Conventional shared variables are replicated on demand and are 
kept consistent using an invalidation-based protocol, that requires 
an owner to be the sole owner of that copy. For Read-only shared 
data once initialized no further updates can occur. 


¢ Migratory data is typically the data that is accessed within 
critical sections and is made consistent by sending an update 
message to the new owner and invalidating the local copy. 


¢ Write-shared variables are frequently written by multiple 
threads concurrently. Its main advantage is to allow considerable 
reduction of the False Sharing effects. 


¢ Synchronization Variables. There are three types of 
synchronization variables which are supported by the system: 
locks, barriers, and condition variables. These variables are 
accessed only through special synchronization primitives 
provided by library routines. The locking protocol is the same as 
the one adopted on Midway. 


Modifications to the shared-variables are buffered until 
synchronization requires their propagation. To reduce the message size each process sends 
only the modifications that were applied to the page. As can be seen from Figure 21 each 
page is initially write protected. When the local thread tries to write into it, a twin copy is 
generated and the original page is marked as writable (Figure 21a). When a release 
operation is performed the page is compared with its twin copy and the resulting diff is send 


to the other copyset nodes (Figure 21b). 
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Encode Copyset 


Make original 
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Figure 21: Write-Shared Protocol: twin creation when a page is accessed for writes 
(a) and sending out diffs (b) when a release operation occurs. 


(3) Update with timeout. To avoid sending unnecessary update 
messages to nodes that still remain in the copyset, but are not referencing the page, Munin 
implements a timeout mechanism in which copies which are not accessed during the last 


timeout interval are discarded. 


C. Treadmarks 


Treadmarks [ACDB94] is a software library DSM system that implements 
the Lazy Release Consistency Model as a consistency mechanism. LRC, described in 
Chapter II, performs the updates at the time of an acquire instead of the release as is the 
case for the Eager model (Munin). Another difference among the two models is that the 
updates are forwarded only to the process that is acquiring the synchronization variable 
instead of sending updates to all processes that currently cache that data. This considerably 
reduces the number of messages and the delay on synchronization operations. 

This set of optimizations do not come for free. After an update, Munin 


(Eager Release Consistency) can forget about all the changes the releasing processor made 
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prior to the release. This is not the case for Treadmarks (LRC). On a release operation the 
set of modifications (diffs) have to be cached as is the case of Midway [BCS91], so that a 
third processor is able to see all alterations that were performed to the data block. To do so 
Treadmark uses vector timestamps to represent the happened-before-1 partial order defined 
in [SH93]. 

When a processor executes an acquire, it sends its current vector timestamp 
in the acquire message. The process that has last performed a release (and is currently the 
lock owner) then piggybacks on its response a set of write notices. These write notices 
describe the shared data modifications that precede the acquire according to the partial 
order. As described on Chapter II, a write-notice is an indication that a page has been 
modified, but it does not contain the actual changes. The acquiring process then determines 
which of the incoming write notices contain vector timestamps larger than the timestamp 
of its copy of that page in memory. For these pages, the shared data modifications described 
in the write notices must be reflected in the acquirer’s copy. To accomplish this Treadmark 
currently invalidates its copies. It is worth mentioning that in [DKCZ94] it is proposed to 
employ a hybrid coherence protocol in which modifications performed at the releasing 
node are updated while for pages for which write-notices and no updates were received will 


be invalidated. 


C. COMPILER INSERTED PRIMITIVES - ORCA 


Orca is an object-based language! whose sequential statements are based roughly 
on Modula-2. Orca was originally designed for the Amoeba distributed operating system 
and it depends on the Operating System’s reliable broadcast feature to enforce consistency 
among objects that are replicated. 

Orca provides two important features for distributed programming: objects and the 


fork statement. Objects are like Abstract Data Types in Ada83. It encapsulates internal data 


1. By object-based language we mean a language with no support for inheritance and some forms 
of polymorphism. 
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structures and methods for manipulating them. Each method can be viewed as a pair of 
statements: guard + block. The fork statement is used to create new processes on a user- 
specified processor. Parameters, including objects, may be passed to the new process, 
resulting in object replication. 

The provision of object replication is the main feature that distinguishes Orca from 
other languages like Ada83. Objects can be in two states: single copy or replicated. A 
method that performs on a non-replicated object is performed by simply locking/unlocking 
the object. For replicated objects consistency becomes a very important issue. As 
mentioned above, the current language implementation enforces consistency through 
broadcasting the objects’ name, the methods, and the parameters. Each remote object then 
performs the operation, thus becoming consistent with the local object. An important 
requirement for this broadcast operation is that it must be reliable and the events should be 
totally-ordered. 

For systems that do not enforce a reliable and totally-ordered broadcast protocol, 
each object will have a primary copy which is responsible for updating all replicated 
objects. The update of replicated objects is performed in two phases. The first phase will 
consist of the object sending a message to the primary copy, locking and updating it as 
before. It is the primary copy’s role then to lock all remote copies. On the second phase the 
primary copy will update the replicated objects. 

For both the primary-copy algorithm or the reliable-broadcast the final outcome is 


that the runtime system enforces a sequentially consistent view of the system. 


D. HARDWARE/SOFTWARE COMBINATION - FLASH 


FLASH’s design exemplifies the current trend on multiprocessor systems 
architecture to integrate DSM and message passing. This is also valid for implementations 
of MIT’s Alewife and *T and the Meiko CS-2 [KOHH94]. 

FLASH is a single-address-space machine consisting of a large number of 


processing nodes connected through a pair (request/reply) of wormhole meshes. 
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Differently from the DASH implementation, each processing node consists of an unique 
processor, local memory and a “MAGIC” chip which is responsible for integrating the 
memory controller, I/O controller, network interface, and a programmable protocol 
processor. 

FLASH nodes communicate by sending intra- and inter-node commands, referred 
to as messages. The provision of a programmable protocol processor within the “MAGIC” 
chip allows the implementation of multiple protocols. These protocols are implemented on 
protocols handlers by defining the kind of messages that will be exchanged (the message 
types). Currently, the FLASH prototype enforces two types of protocols: the Cache- 
Coherence and Message Passing protocols. 

The Cache-Coherence protocol is directory-based and is similar to the one used on 
the DASH implementation with minor modifications: the coherence unit was enlarged 
from 16 to 128 bytes and, for scalability reasons, it uses dynamic pointer allocation instead 
of a bitmask as was the case of the DASH prototype. Another distinction between the two 
machines is that invalidation acknowledgments are collected at the “Home” node for the 
FLASH implementation. For this protocol all messages are divided into requests (read, 
read-exclusive and invalidate requests) and replies (read and read-exclusive data replies 
and invalidation acknowledgments). 

The Message Passing protocol defines a set of primitives for enforcing 
synchronization and block transfer. The latter set was designed to fulfill three requirements: 
provide user-level access to block transfer without sacrificing protection; achieve transfer 
bandwidth and latency comparable to a message-passing machine containing dedicated 
hardware support for this task; and operate in harmony with other key attributes of the 


machine including cache coherence, virtual memory, and multiprogramming [KOHH94]. 
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IV. LAZY DATA MERGING CONSISTENCY MODEL 


This chapter gives a comprehensive description of the Data Merging (DM) 
consistency protocol as proposed by Karp and Sarkar in [KS93] and the modifications 
introduced to it that resulted on Lazy Data Merging (LDM). Section A describes DM as 
originally proposed. Section B defines LDM and in Section C we introduce some examples 


that illustrate and compare both protocols. 


A. THE DATA MERGING DSM PROTOCOL - AN OVERVIEW 
In [KS93] a new DSM protocol is presented: Data Merging (DM). DM is based on 


the observation that any sharing that occurs between synchronization points during a 
parallel execution is false. For false sharing it is not necessary for caches to be consistent; 
only global memories need to be consistent. In particular, any concurrent updates to the 
same data block can be deterministically merged at global memory [KS93]. Therefore, 
DM, like other “relaxed” consistency protocols (e.g., Release Consistency, Entry 
Consistency, etc.) relies on explicit synchronization operations to enforce consistency. 
Data Merging also addresses the problem of false sharing by providing means for 
deterministically merging data blocks. 

This protocol introduces a new feature; the ability to combine message passing and 
DSM. This characteristic is indicated for data sets which present poor locality of reference 
or to enforce sequential consistency, since it provides exclusive access to the data elements. 
Therefore, a remote thread will be able to explicitly perform remote read/write operations 
on individual data elements through the use of “Bypass Cache” messages. 

The DM protocol involves two fundamental components: Global Memory Units 
(GMU) and Processing Elements (PE). The GMU can be better described as the “Home” 
node for a set of data blocks. The current proposal adopts a distributed/fixed manager 
approach for dividing the shared address space. Therefore, multiple GMUs are allowed, 


each one being held responsible for a part of the shared address space. 
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The Processing Elements represent the remote threads in which the actual 
computations are performed. Each PE has a “local memory controller” which is responsible 
for performing data requests/updates when necessary. In the original protocol updates are 
addressed to the corresponding GMU. This characteristic can also be observed in the 


DASH implementation. 


B. DATA MERGING PROTOCOL: DEFINITIONS 


Before proceeding on with this chapter, we need to define, for both the GMU and 


PE, each component element and its corresponding role. 


1. Processing Element (PE) 


PE is the processing unit that is replicated on multiple processors in a 
multiprocessor system and is composed of: 


¢ Local CPU; 


¢ Local Memory - The memory hierarchy level in the PE that interfaces with the 
global memory. If the PE itself has a multilevel memory hierarchy, then “local 
memory” refers to the lowest level (furthest from the CPU) contained within the 
PE; and 


¢ Local Memory Controller - The control logic for issuing global memory 
requests from the PE. 


2. Global Memory Unit (GMU) 


GMU is the memory unit that is replicated to obtain a shared global memory that is 


addressable by all PEs. 


¢ Global Memory Module - A piece of the shared global memory. Some storage 
in the global memory module is reserved for GMU state information; and 


¢ Global Memory Controller - The control logic for handling global memory 
requests from PEs. 


One implementation, suggested by Karp and Sarkar consists of multiple PE 


connected to a set of GMUs through an Interconnection Network. This organization is 
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depicted in Figure 22. An alternative design would allocate to each node both the 
processing element and the global memory unit. To provide a scalable design, such nodes 
should be organized using a Hierarchical Bus or Ring Network as is the case for the DASH, 


Alewife, and FLASH multiprocessor architectures. 


C. THE DATA MERGING PROTOCOL 


Similar to existing software implementations (i.e., Munin, CarlOS, and 
Treadmarks), this mechanism addresses the problem of false sharing by allowing multiple 
writers to the same data block. The distinction is that there is no required protocol-type 
annotation for shared variables as is the case of Munin in which data variables that allow 
multiple writers should be annotated as “write-shared’. 

For data that is accessed by multiple processing elements, it is assumed that a 
delayed memory consistency model and the synchronization mechanisms that it requires 
are implemented. This mechanism also provides direct accesses to the Global Memory 


through “Bypass Cache” messages. 





Figure 22: Data Merging Components. 
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The actions performed by each system component can be described as follows: 


1. Processing Element 


¢ Request cache block: request a copy of a data block from the GMU that owns 
it; 


¢ Flush Data: when replacing a dirty cache block, send its contents back to the 
GMU that owns the original data block; 


¢ Report Replacement: when replacing a clean cache block, report its 
replacement to the GMU that owns the original data block; 


¢ Bypass-read data element: the CPU reads a data element from global memory 
without storing a copy in the PE’s local memory; 


¢ Bypass-write data element: the CPU stores a data element in global memory 
without storing a copy in the PE’s local memory. 


Bypass read and Bypass Write messages may be used to enforce sequential 
consistency, rather than delayed consistency, on selected accesses to global memory. They 
are also more efficient for reads/writes accesses that have neither temporal nor spatial 
locality. The state transition diagram for each block within the PE node is illustrated in 


Figure 23. 
FSM - PROCESSING ELEMENT 


Flush to GMU 


wig, hire Bypass-Read 
Bypass- Write 





Figure 23: Processing Element Finite State Machine. 
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The cache block can be made invalid either by normal cache management policies 
or by an invalidate signal sent by the GMU. If the block is dirty it is flushed; if clean, its 


replacement is reported to the GM, but no data is moved. 


2. Global Memory Unit Actions 


The GMU keeps storage for the global data blocks and for monitoring the state of 
each individual block. For this purpose it has two major data structures: a dynamically- 
updatable “Suspend Queue” capable of holding up to one entry per process in the 
multiprocessor system and a “‘bitmask” that keeps track of the state of each individual 
element within a data block. Each GMU also maintain the following state variables to 
monitor the state of each individual block. The state transition diagram for each block 
within the GMU node is illustrated in Figure 24. 


¢ Counter (C): Identifies the number of PEs that currently have a copy of the data 
block in their local memories. 


¢ Suspend Bit (S): Indicates whether or not a process should be suspended when 
attempting to access the data block. 


¢ Bitmask: The number of bits corresponding to the number of elements in the 
data block. Its purpose is to identify dirty elements in a data block. 


FSM - Global Memory Unit 


Replaced 


Send Block 


Flushed 
Replaced 





Figure 24: Global Memory Unit Finite State Machine. 
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3. Actions Performed by the GMU in Response to the Processing Element 


Requests 


a. Request Cache Block 


¢ The suspend bit (S-bit) = 0: The GMC increments the counter and 
sends a copy to the PE; 


¢ The suspend bit (S-bit) > 0: The GMC inserts the data block request, 
requesting processor ID and current time stamp into the GMU’s suspend 
queue. 


b. Flush Data 


In this case, the PE replaces a cache block in its local memory and sends the 
contents of the old cache block (C) to be merged with data block D in the GMU. There are 
two cases of interest for the state information associated with data block D: 


¢ Counter = I and S-bit = 0: The GMC stores cache block C into data 
block D and resets the counter to zero. 


¢ Otherwise: In this case the GMC uses the bitmask to merge selected 
words from cache block C into data Block D, by comparison on a word- 
by-word basis of the two blocks. If the words are different, set the 
corresponding bit of the bit mask to one and set the suspend bit to 1. If the 
Counter becomes zero, then perform a bitmask-reinitialize operation. 


c. Report Replacement 

The PE sends a notification of the replacement of a data block without 
actually sending any data. For this case all is needed is to: 

¢ Decrement the Counter. 


¢ If the Counter becomes zero, then perform a bitmask-reinitialize 
operation. 


d. Bypass-Read Data Element 


The GMU sends the data element to the requesting processor. No state 


information needs to be checked or modified. 
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e. Bypass-Write Data Element 


When a PE writes a data element by bypassing local memory (cache) the 
GMU updates the corresponding global memory location. No state information needs to be 


checked or modified. 


f. Lock 


In this case the processing element has requested that the block be locked. 


¢ If the S-bit is set, insert the request into the GMU’s suspend queue along 
with the requesting process “id” and a special flag in place of the time 
stamp. 


¢ If the suspend bit is not set, set the bit, and return its old value. 


g. Unlock 


Set the suspend bit to zero and perform a bitmask reinitialize operation. 


h. Test and Set Lock 


Set the suspend bit to one and return the previous value of the suspend bit 


to the requesting process. 
1. GMU Internal Actions 


(1) Initialization: For each block of global memory a NULL 


bitmask pointer, a Counter, and Suspend Bit should be initialized. 


(2) Receive a dirty page: The counter should be decremented 
whenever a dirty page is received. The Bitmask pointer should be initialized with all 
elements set to zero. This represents the occurrence of multiple writers, but it solves the 


problem of false sharing. 


(3) Bitmask reinitialize: When the Counter becomes 0, deallocate 
the bitmask and Suspend Bit. Check if this event affects any processor in the suspend 


queue. 
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(4) Scan the Suspend Queue: To avoid deadlocks the GMU 
controller should periodically scan the suspend queue. If any data block request is older 
than a threshold age, the GMU should perform a timeout procedure by broadcasting an 
invalidate message, forcing the cached data block to be flushed. The existing copies of the 
data block are then merged, becoming consistent. The requesting process is then awakened 


by the release of the data block. 


D. LDM RATIONALE 


The suggested extensions to the DM protocol led us to define a “lazy” version 
which we name “Lazy Data Merging”. For this approach all data objects are considered as 
“write-shared”. Therefore, program correctness will rely on the adequate use of 
synchronization variables by the application programmer. 

Our approach introduces some enhancements to minimize the communication/ 
synchronization overhead generally encountered by software runtime libraries. The 
implementation details for this protocol are discussed in Chapter V. . 

As with DM we adopt a “distributed/fixed” policy for dividing the shared address 
space (static data partition as opposed to the adaptive partition scheme adopted by Munin). 
However, our protocol differs from DM by insisting that each node will have both PE and 
GMU threads sharing the same address space. Therefore, individual nodes are assigned as 
“home” for a specific set of data blocks. 

Our goals for the LDM protocol are both the reduction of the average message size 
and the number of messages. Message size reduction is achieved by forwarding “diffs” at 
the time of a release or a flush operation. We reduce the average number of messages 
through the use of a distributed locking scheme [FB88] and the Hybrid Coherence protocol 
[DKCZ93]. This combination will decrease the number of messages that will be dispatched 
to a single node at the time of a lock release operation when compared to an invalidate 
protocol. Also, the main advantage of the distributed locking scheme is to reduce the 


contention when compared to the centralized approach. 
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The performance gains that can achieved through the use of “diffs” for pages that 
are dirty are not very well defined. The result will be highly dependent on the type of 
network, number of participant nodes, granularity of the data coherence unit, etc. Our 
proposition is to use “diffs” for all data blocks that are dirty. Our belief is that when the 
ratio “processing power/ network bandwidth” is high it is worthwhile in creating diffs. 

The next sections will give an overview of this memory consistency protocol and 
in Chapter V we describe the mechanisms for implementing the modules specified on 


Figure 25. 
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Figure 25: Lazy Data Merging Runtime Environment. 
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EK. EXTENSIONS TO THE DATA MERGING PROTOCOL 


To optimize performance for a software implementation we introduce the following 
extensions: 


¢ In Lazy Data Merging the shared address space is structured as a set of shared 
data-objects, while in DM the address space is flat. 


¢ In order to reduce the number of invalidate messages each Home node 
maintains a Directory with the nodes that currently cache the given page, instead 
of broadcasting invalidate messages to all PEs. 


« We suggest using a distributed locking scheme [FB88] by employing a hybrid 
(invalidate/update) coherence protocol [DKCZ93]. The semantics for the 
acquirer are similar to the Lazy Release Consistency model. 


¢ The GMU is co-located with the PE on the same node. Both threads share the 
same address space. 


« We assume two types of locks: Read and Write. This approach is similar to the 
one introduced by Midway. Lock ownership will only be modified if a lock is 
acquired for writing. 


¢ Updates will be encoded by using diffs to the original pages. The purpose is to 
reduce the network load, by reducing the size of messages. This approach 
imposes some overhead to compute diffs of each page and also requires extra 
storage for data blocks that are dirty, so that we can capture all changes 
introduced to the data block, but is in large compensated for relatively slow 
networks. The use of “diffs” is also an imposition of the protocol for allowing the 
correct propagation of modifications on shared data to the last acquirer. 


F. THE LAZY DATA MERGING PROTOCOL 


As we have already described, the shared address space will be evenly divided 
across the set of nodes that are part of the system. Therefore, each node will be assigned as 
“Home” node for certain block segments. Each “Home” node will, in turn, be responsible 
for performing the data merging operations whenever a block it manages is flushed from 
one of the remote caches. 

This new approach addresses the problem of false sharing as aggressively as “Data 
Merging” does: we move some burden to the programmer, by requiring that all shared data 
blocks that are concurrently accessible by multiple nodes should be protected by the 


appropriate synchronization mechanism. 
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Besides managing “Home” data blocks, the DSM thread handles data block misses 
(detected by catching “SIGSEGV” signals) and performs the data requests to the 
appropriate block owner. Other roles are to create diffs for dirty pages (by the time of a lock 
release, a global barrier call or when an invalidate message is received), mark the page as 


dirty (page protection is set to PROT WRITE) and update the set of write-notices. 


1. Protocol Notation 


Before we can describe the protocol itself we need to define the notation that is 
adopted, which we believe is appropriate for explaining the LDM protocol by providing 
means for representing the messages interchanged and its arguments and the actions 


undertaken at both ends (sender/receiver). 





Node 1 Node 2 Node 2 


Message }( arg], ...) 





Process Boundary Interprocess Communication Composed actions: a) block of 
actions and b) conditional acttons 


(a) (b) (c) 





Figure 26: Notation for description of LDM. 


Each process context is inserted within one frame limited by vertical lines. This 
representation is depicted in Figure 26a. There are five basic types of processes: 


* requesting node: the node which issues either a request for a page, an acquire 
lock, or a barrier call; 


¢ home node: the node that is responsible for the block management; 
* copyset nodes: the group of nodes that currently cache a copy of the same page; 


¢ lock owner: the current probable owner (see Chapter III for the details). 
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¢ write-notice node: the node that has last modified a write for that particular data 
block. 


In general terms a message is described by an arrow that crosses the process(es) 
boundary (Figure 26b). The actions may be atomic or composed actions. Composite 
actions are involved by a line that extends until the end of the block (Figure 26c). 
Conditional actions are actions that depend on the internal state of the processing node 
(Figure 26c). 

All of the above events are described within the domain of time, with the initial 


event being represented at time “0” and the last one at instant “?’. 


2. Description of LDM actions 


This section describes the set of actions that are taken by each node in presence of 


the following events: page faults, bypass-cache requests, and synchronization events. 
a. Page Fault 


The page fault handler! should, in turn, convert the faulty address into the 
page number and “hash” it into the appropriate Home node. Our design provides means for 
establishing the policy for static division of data among the nodes (i.e., block partitioning, 
cyclic stripe partitioning, etc.). The default policy will consist of dividing the shared 
address space into blocks of four contiguous pages. Once the home node has been 
determined a request block message is issued. 

Page requests are handled in the same way as defined on the Data Merging 
protocol. Upon receipt of a Request_Block type message, the Home node will verify if the 
S-bit 1s set. If so, the requesting node should be inserted on the Suspend Queue. This 
request will remain on the Suspend Queue until S-bit = 0 or a timeout occurs. 

If the S-Bit is not set then the Home node includes the requester ID on the 


block copyset and forwards the requested block. These actions are described in Figure 27. 


i. A SIGSEGY signal handler. 
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Discrete Time Line. 


Tiny Requesting Node Home Node Copyset Nodes , Write-Notice Node 


« Block Miss. 
If Block is associated to 
a lock. 
. If there is a node 
that has issued a write- 
notice. 
. Send the request to 


Mus: mode, Request Data Block [Block N); 
else send the request to the 


Send_Data_Block_Djiff (Block N); “Send the block 
diff 
appropriate home node 


Request Data Hlock (Block N) 


If S-bit = 0 -> insert the 
request Id on the block 
copyset. Send the block 
to the requesting node. 


if S-bit = 1 -> insert the 
request Id on the 
Suspend Queue. Reply 
with a Block Process 
message. 


If a Timeout occurs 
Invalidate Messag¢ (Block_ Number) 


lf Block Dirty 
. Prepare the diff block 
. else 
. Reply with a CLEAN 
message. 


Update/ Cleag Messages 
- Merge all updates 


Block Grangd/Block_Process (Block_Num) 


. Copy data block to shared 
address space. 





Note: If the block is missing by the first time, the request should be forwarded to the appropriate home node, before a diff request 
can be issued to the write-notice node. This FLAG should be reset if the data block is also associated with a barrier object. 
The merge of Updates will be better defined during the Barrier Call definition. 


Figure 27: Performing a Data Block request. 


b. Bypass-Cache Messages 





- Select the diff for 


(1) Bypass Read Messages: as mentioned before, our protocol 


provides means for performing remote reads. Upon receipt of such a message, the home 


node will request update messages from all blocks that currently cache the requested block 


and merge all these nodes. After this it will forward the requested block. By updating rather 
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than invalidating remote copies, we allow multiple processes to access the same data block 
for read operations. These messages should be used with data that presents poor locality of 


reference or to achieve a sequentially consistent program. These actions are shown in 


Figure 28. 


Discrete Time Line. 


Time Requesting Node Home Node Copyset Nodes Lock Owner 


. Bypass Read: 
Bypass Read (Address, Length) 


If (Copyset == Null) 
-Forward the value. 


else 

- Request updates 
from all copyset 
elements. 


Send_Updaf (Block_Id) 
If data block is clean 


reply with a CLEAN 
message. 


else if dirty 

- Prepare the diff 
for the given block 
and forward it.* 


Update / CLEAN Messages 


- Merge all modifications 
performed on the block. 
. Forward the requested 
Data Value to the reader. 


Reag Reply (Value(s) ). 
- Read the received value. 


Note: With small modifications it is possible to return the requested object only, instead of the diff for the whole page. 


Figure 28: Bypass-Read Messages. 


(2) Bypass-Write Messages: Once a bypass-write message is 
received the home node should “invalidate” all remote copies of the given block. Then the 
Home node should merge all received updates and perform the write operation. These 


actions are described in Figure 29. 





Discrete Time Line. 


Time Requesting Node Home Node Copyset Nodes Lock Owner 


Bypass Write |Address, Length) 


If (Copyset == Null) 
. Write the value. 


else 
. send Invalidates to 
all copyset elements 


Invalidate (Bjock_Id) 


If data block is clean 
- Reply with a CLEAN 
message. 

- Invalidate the local 
copy.* 

else if block is dirty 

. Prepare the diff 
for the given block 
and forward it. 

- Invalidate the local 
copy.* 


CLEAN / Update Messages 


-Merge all modifications 
performed on the bloc 

. Perform the write 
operation. 


Reply (Written). 


NOTE: The reason for the use of Invalidate messages is to force processors to retrieve the most recently data value. Therefore we 
can enforce that the “appropriate”’ use of “Bypass Cache” messages is sequentially consistent. As can be observed for the 


“Bypass Read” message we do not require that the local copies to be invalidated, since we adopt the MRSW coherence 
protocol. 


Figure 29: Bypass-write message. 


c. Synchronization Operations 


We rely on synchronization operations to enforce consistency among 
multiple threads. Therefore, any data access that may result on data race conditions requires 
the use of explicit synchronization operations. For this purpose we provide two basic 
synchronization mechanisms: Locks and Barriers. The actions for each mechanism are 
described in the following figures. We use diffs to minimize the effect of network latency 
(by reducing the message size) and for correctness [K95]. Diffs are obtained by creating a 


copy of the original block (a twin copy) before a node tries to write into it. At the time of a 


65 


release/barrier call operation the differences between the twin copy and the original block 


are inserted on the diff. This action is summarized in Figure 30. 


Protection Fault Invalidate/ 


———_—_ - Encode 
Create Twin a a ees 2S Release_Lock changes 


Barrier call 


Mark 
page as 
writable 





Figure 30: Diffs creation process. 


d. Partial Ordering Definitions 


In order to define a partial order between multiple intervals we need to 
enforce the requirements established in [AH90] for the relation happens-before. The 
requirements for this relation can be described as follows: 


¢ If a; and ay are accesses on the same processor, and a; occurs before ay in 
program order, then a, happens-before ap. 


If a; is a release on processor pj, and ay is an acquire on the same memory 
location on processor py, and ay returns the value written by a;, then a, happens- 
before ap. 


¢ If a, happens-before ay and ay happens-before a3, then a, happens-before az. 
[KCZ92]. 


Each node maps to an index in the Vector Timestamp, therefore, the logical 
clock of node; maps into index “i“ of the corresponding Vector Timestamp. The happens- 
before relation can be enforced through the following criteria: 


¢ At an acquire_lock the requesting node sends the vector timestamp of 
the last release. The lock owner, in turn, will send all write-notices that 
were performed after the received Vector Timestamp or that are 
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concurrent to the received vector timestamp. The lock owner should also 
release the “diffs” of the shared objects that it might have modified. 


¢ Once the lock is acquired the new lock owner will update the blocks for 
which it received the diffs and invalidate the ones for which write-notices 
were received, but no “diffs’’. 


¢ The lock owner will update its Vector Timestamp, by incrementing its 
logical clock and replacing it on the received Timestamp. 


The rules for updating the Vector Timestamps are as follows: 
* Rule 1: Clock C; is incremented between any two successive events in 


process P;, such that C; [i] = C; [i] + 1; 


¢ Rule 2: If event “a” is the sending of a message “‘m’” by process P., then 
“m” has a Vector Timestamp VC = C; (a) (using rule 1). When process P; 
receives the message it updates its Vector Clock to: 


Vk, or = max(C {k} VT [k]) 


¢ where C; (a) corresponds to the Vector Timestamp to any event a at 


process 1. For our approach we consider as conspicuous events only the 
operations that result on the issue of “diffs” (release locks, barrier calls, 
and invalidates). 


The example of Figure 31 clarifies this issue. In this example three 
processes, P;, P2 and P3 are requesting the same write-lock. At each acquire that is granted 
the lock owner will update its own VC and forward the lock with the corresponding write- 
notices and updates that are larger than the received VC. The acquirer should, in turn, 
update its own logical clock. 

Based on the above rules we can state that “a happen-before-I b” if and only if 
VC(a) < VC(b), otherwise they are said to be concurrent. The definitions below describe 
when VC(a) is less than VC (b): 
¢ Not equal: 
VC, #VC,& ai VC Li] #VC, [i] 
¢ Less than or equal: 


VC_<VC, © vif VC lil <VC, Li ) 
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¢ Less than: 
VC_<VC, ©@| VC_<VC, |al| VC_#VC 
a b a b a b 


¢ Concurrent events: 
vc_llvC, = VC_< VC; A + VC, < VC) 


[0,1,0] [0,1,0] {1,1,0] 


el} 
Acq_Lock 


{0,1,0] 


[0,0,0] [1,1,0] 


Global Time 


event e21: will forward the wnite-notices performed by process P2 and the corresponding diffs. 

event e11: will forward to P3 two write-notices [0,1,0] and {1,1,0] and the diffs that corres - 
pond to event e11. 

event e31: will forward to P2 two write-notices [1,1,0] and [1,1,1] and the diffs that corres- 
pond to event e11. 


event e22: will forward to P1 two write-notices [1,1,1] and [1,2,1] and the diffs that corres- 
pond to event e22. 


Note: At the release time the lock owner should send the diffs and write-notices that are larger 
than the corresponding received Vector Timestamp. 





Figure 31: Vector Clock implementation. 


e. Read and Write Locks 


As mentioned before, we introduce two types of locks: Read and Write 
locks. “Lock ownership” will only change when a lock is acquired for “writing” . When a 
lock is acquired for reading, the current lock owner should introduce the requester “id” on 
the lock copyset and forward the write-notices and updates (Figure 32). The lock acquirer 
should, in turn, perform the updates to the local pages and invalidate all pages for which 


write-notices and no diffs were received (Figure 32). At the time of a lock release the reader 
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should report to the lock owner (Figure 33). The lock owner should, in turn, upon receipt 
of the release message remove the node from the list of readers. A lock should be 
considered “WRITE-AVAILABLE” if and only if the list of readers is empty. A lock should 
be considered “READ-AVAILABLE” if and only if the lock owner is not currently holding 
it. By adopting this policy we allow multiple readers to concurrently access a critical 
section, but a unique writer can be within a critical section at a time. 

When a write-lock is acquired (Figure 32), the lock owner will forward to 
the new acquirer the set of write-notices, the diffs and the queue of processes waiting for 
the lock (if any). Once the lock acquired message arrives, the new owner will execute the 
Same actions described for a read-lock. When a release on a write-lock is performed the 
new acquirer should check the lock queue. If there is any process waiting for the lock it 
should create the diffs for all pages marked as dirty and forward them to the requester. As 
mentioned before, if the lock is being acquired for write, the ownership is altered, otherwise 
the process maintains the lock ownership (read locks) (Figure 33). 

To enforce consistency, the node that is acquiring a lock must be aware of 
all modifications introduced on the data protected by the synchronization variable. We 
solve this problem by adopting the same approach as on Treadmarks [ACDB94] by 
forwarding the write-notices performed on the blocks. A write-notice is an indication that 
a page has been modified during a particular interval, without specifying the actual 
modifications. But, an acquiring node does not need to be aware of all write notices. An 
acquire operation should only receive the set of write notices that were performed by other 
nodes after its last release operation. This requirement introduces the notion of enforcing a 
logical time so that we can achieve a partial ordering of events. This issue is addressed 


through the use of vector timestamps as proposed on [KCZ92]. 
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Discrete Time Line. 


Time Requesting Node Home Node Other Nodes Lock Owner 


- Acquire Write- Lock: 


. Locate the “probable” 
lock owner. 


. Send an Acquire_Lock 
message. 


Acquire Lock (Lock Id, WRATE, Vector Timestamp) 


If Lock_Owner 


If Lockid == FREE 
. Send the set of 
“write-notices” 
which are greater 
than the received 
Vector Timestamp 
and the updates 
for the blocks which 
were locally written. 
. Update the Lock 
owner to the new 
owner 


If Lock_Id is not Free 

. Insert the requester 
and the Vector Time 
stamp on the queue 
for that lock. 









Lock_Granted (Write_Notice¢*, Updates, WRITE, Lock Queue]. 


If not Lock_Owner 
- Forward the request 
to the next probable 
owner. 


. Update Lock ownership 
to the current Node ID 


- Update ail blocks for 
which updates were 
issued. 


Forward Req (Node_Id,Lock_ Id, 


Timestamp). 
If Lock_Owner 


. Invalidate all blocks . Process Request. 


for which write notices 
and no updates were 


F es Else 
received (this will force : 
requests to be done on . Forward Request to 
demand. So that only the current owner. 
blocks that are needed 


are requested). 


Requ¢st_Update (Block_Id). 


Note: 1- The Write-Notices consist of the triple (Block_Id, Timestamp, Last_Writer_Id). If two processes modify the same block 
their changes should be merged into a single update block and the write_notice triple should be updated to both the 
ID and Timestamp of the last modifier. If the size of the update block gets larger than the block itself we should replace 
the Update Block by the block with a special annotation. 


2- For a pure software solution, invalidations can be handled by modifying the protection of the given data block. 
During the invalidation process the block owner should be replaced by the corresponding last writer. 


Figure 32: Synchronization event: acquiring a Write-Lock. 
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Discrete Time Line. 


Time Requesting Node Home Node Other Nodes Lock Owner 


- Acquire Read-Lock: 


. Locate the “probable’’ 
lock owner. 


- Send an Acquire Lock 
message. 


Acquire Lock (Lock_Id, REAIP, Vector Timestamp) 


If Lock_Owner 
Lockid == FREE 
. Send the set of 
“write-notices”’ 
which are greater 
than the received 
Vector Timestamp 
and the updates 
for the blocks which 


were locally written. 


If Lock_Id is not Free 

- Insert the requester 
and the Vector Time 
stamp on the queue 


for that lock. 
Lock_Granted (Write_Notices*] Updates, READ). 


U the tocdkeOwne else If not Lock_Owner 
- Update . Forward the request 
to the current owner. to the next pr pe 2 
owner. 
- Update all blocks for 
which updates were Forward _Rbq| (Node_Id,Lock_Id, 
oe Timestamp). 
- Invalidate all blocks 
for which write notices 
and no updates were 
received ( this will force 
requests to be done on 
demand. So that only the 
blocks that are needed 
are requested). 
Requqst_Update (Block_Id). 


Note: 1- The Write-Notices consist of the triple (Block_Id, Timestamp, Last_Writer_Id). If two processes modify the same block 
their changes should be merged into a single update block and the write_notice triple should be modified to both the 
ID and Timestamp of the last modifier. If the size of the update block gets larger than the block itself we should replace 
the Update Block by the block with a special annotation. 





2- For a pure software solution, invalidations can be handled by modifying the protection of the given data block. 
During the invalidation process the block owner should be replaced by the corresponding last writer. 


Figure 33: Synchronization event: acquiring a Read-Lock. 





Time 
0 





Discrete Time Line. 


Requesting Node, Home Node Write Nodes Lock Owner 


Release Lock 


For each block marked 
as dirty. 


. Create diff 

Blocks for all 
blocks marked as 
dirty. 

- Insert the new 
write_notice in the 
list for the Lock Id. 


If Lock_Id. Queue /= empty 


While next 
lock request == READ && 
LockID == READ FREE da 


. Send a Jock granted 
message. 

- Remove the requesting 
node from the lock queue 

- Insert the Node_Id into 
the Readers-List. 


Lock_Granted (Write_Notifes*, Updates, READ). 


If Lock _Type == READ 


- Update the pages for which 
write-notices were received 
and update the Vector 
Timestamp. 


. Perform all Computations 


- Release the lock -> send a 
message to the requesting 
node, 






Lock Release (Node_ 


- Remove Nodeld from fist of 
readers for that lock. 
(readers List==empty) set 
-Set LockID = WRITE-FREE 
f next lock_request==WRITE 
&& LockID== WRITE-FREE 
. Send a lock granted 
message to the next 
acquirer Node on the 
queue. 





. Set the lock owner 


Lock [Granted (Write_Notices*, Updates, WRITE, Lock Queue). to the new acquirer. 


else 

Perform the same actions 
specified for the acquire_ 
iock. 


Note: The update for the write-notice will consist of updating the Timestamp with the current one and by replacing the 
last writer by the current Node_Id. Therefore, the last process that writes into a data block will hold the write-notice. 


Figure 34: Synchronization event: performing a lock release. 
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f. Barrier Call 


Barrier synchronization primitives fulfill two distinct purposes: 
synchronization and consistency. Its use allows us to evict the dirty blocks to their 
corresponding Home nodes, forcing the global memory to enter in a consistent state. For 
this purpose we allow the user to specify the data set that should be flushed to its 
corresponding home node. If the barrier’s purpose is to perform global data coherence, all 
dirty blocks should be flushed to their corresponding home nodes. We name this type of 
barrier primitive as a “Converge Barrier” and it should be explicitly used whenever 
“global memory” must enter in a consistent state. If the barrier is a “local barrier’ only the 
data specified at its creation should be flushed. 

In our protocol the barrier primitives are designed adopting a centralized 
approach. A barrier call should be executed in two steps. On the first step each node will 
send their updates to the corresponding home nodes and invalidate its local copies. Upon 
completion of this initial step the process should perform a call to the designated barrier 
manager. The barrier manager, in turn, monitors the number of processes that have 
executed a barrier call. When this number is equivalent to the number of registered 
processes the manager multicasts a“CROSS_ BARRIER” message. Upon reception of such 
message each process should wake-up. 

When the home node receives the updates it is possible that two 
modifications to the same location are received (only if the data is associated to a lock). For 
this purpose the relations happens-before should hold, otherwise, concurrent accesses are 
performed to the same memory location, resulting on a nondeterministic result. The rules 
for merging diffs can be summarized as follows: 


¢ If the vector timestamp of the received diff is larger than the original one 
the diffs are applied to the block even if the corresponding bits on the 
bitmask are already set. 


¢ If the vector timestamp of the received diff is smaller than the initial one 
then the received diff is discarded. 


¢ If the vector timestamp is concurrent with the Vector Timestamps 
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already used, then the diff is applied to the data block if and only if no two 
diffs modify the same location, otherwise an error condition shall be 
raised. This feature will not prevent concurrent writes to the same 
memory location, but will detect them during runtime. 


Figure 35 illustrates a barrier call execution. 


Discrete Time Line. 


Timey Requesting Node Home Nodes Barrier Manager 
0 
Barrier_Call (Barrier ID); 


. Compute the updates 
of all data blocks that are 
marked as dirty and 
forward them to their 
corresponding Home 


Nodes. 
- Envalidate all data 
blocks associated with 
the Barrier. 
Update_Messages 
. Set S-bit = 1. 
- Merge all incoming updates. 
- Decrement the Counter. 
- When all nodes that cache a copy 
of that block have flushed the 
given block Set S-bit = 0.* 
. After all modified . Remove the process from the list 
data has been flushed of nodes that cache the given 
to their corresponding blocks. 


home nodes send the 
barrier message to the 
root node. 


Barrier_Call (Barrier_Id) 


For each Barrier cal] received 

increment the counter, until the 

counter = number of nodes. 

. When counter = number of 
nodes send a barrier crossed 


crossed message is 


. Wait until a barrier 
received. 





call. 
Cross_Barrier (Barrier_Id). 
- Proceed until a next - Reset the local counter to 1. 
barrier call is 
performed. 


Note: This condition is detected when Counter == 0. 


Figure 35: Synchronization event: barrier call. 
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G. PUTTING IT ALL TOGETHER 


We use a couple of examples to clarify the differences between Data Merging and 
Lazy Data Merging. The first set of examples describe a more synchronization intensive 
problem, in which we describe the actions that are adopted by both DM and LDM protocols 
and provide a qualitative analysis of the communication costs involved. 

For uniformity, the data distribution described in Figure 36 is adopted by all 


examples. 


1. Distributed Data Base Problem 


Assume that each node needs to lock the data base record before accessing its fields. 


The problem can be summarized as follows: 


While not done loop 
acquire lock (lock_id) 
perform_modifications. 
release_lock (lock_id) 

end loop. 


For this type of problem our approach should perform better than Data Merging 
since each release operation will require that all data associated with the lock to be flushed 
to the corresponding GMUs. 

In our approach the data will be released only at the time of an acquire and only to 
the node which is requesting the lock. The releasing node should, in turn, send all 
modifications (write-notices) that have happened after the vector timestamp received from 
the requester. As before, we also optimize by sending only updates (diffs), instead of the 
entire data block. The use of a hybrid coherence protocol will reduce the amount of 
communication since the new acquirer will be able to update all blocks for which the last 
releaser has introduced modifications. 

To better describe the actions that are taken by both protocols under the presence of 


synchronization operations we use the code example below. 


ie 





Processor 1: Processor 2: 


Acq_Lock (LockID) Acq_Lock (LockID) 
a A = 10; 
Release_Lock (LockiD); Release_Lock (Lock!D); 


The next two subsections describe the interactions between the multiple DSM 
system components for the example described above, under both the DM and LDM 


protocols. 


a. Data Merging 


An acquire lock for the DM protocol would require a larger number of 
messages for performing a lock operation. Locking each individual data block requires that 
all remote copies should be invalidated before the lock could be granted. For the original 


approach an invalidate message should be multicasted to the entire data block copyset. 


' Lock 1! 


Block 1 || | Block 5 


Regions managed by GMU/Node 1. 
Regions managed by GMU/Node 2 





Figure 36: Data distribution across the blocks. 


Each node should reply by flushing the block, if dirty, or by sending a clean 
message in otherwise. After the block is merged at the GMU, the GMU forwards the block 
to the lock requester and sets the S-bit to one to ensure that new requests are put at the 


Suspend Queue until the lock is released. 
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At the time of a lock release, the lock owner should flush the page to the 
corresponding GMU. The GMU should decrement the counter and reset the S-bit to zero. 
The changes are inserted into the corresponding data block and the bitmask is then cleared. 


This sequence of actions are described in Figure 37. 


Discrete Time Line. 
Time Node 1 GMU 1/ Z Node 2 
Acquire Lock (1). 


- Request block 2 for 
locking mode. Request Block [2, Lock). 


If S-bit block 2 == 0 then Acquire Lock (1). 
: if counter == 0 set S-bit = 1 a 
else invalidate remote 
goples merge then and 
Set S-bit = 


else insert node id on Suspend_Queue. 
Blockj2 


Request Block (2, Lock). 


If S-bit block 2 == 0 then 
if counter == 0 set S-bit = 1 
else invalidate remote 


copies, merge them and 
Set S-bit =f 


else 
. insert node id on Suspend_Queue. 


Node 2 inserted on the suspend 
Release Lock (1) Queue on GMU 1. 


Flush block 2 


- Merge the data block 2 
set S-bit = 0, Counter = 0 
- Verify requests on the suspend 
ueue. 
request is Lock Request, 
act as normal lock requests. 


Block 2 


A= 6; ; 
ee Release Lock (1). 


- Merge the data block 2 
set S-bit = 6, Counter = 0 
. Verify requests on the suspend 
ueuve. 
- If request is Lock Request, 
act as normal lock requests. 


Figure 37: Data Merging. 


b. Lazy Data Merging 


Now we describe the actions performed by LDM for the same code example 


of Figure 36. In LDM each lock operation requires a smaller number of messages by 
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avoiding the issue of multiple invalidate messages, receiving their changes and forwarding 
the updated block to the acquirer. LDM is also expected to reduce the average message size 
by forwarding diffs to the next acquirer instead of the entire page. 

A more generic example would suffice to give a quantitative view of the 
number of messages that each approach would require. Assuming that “n” nodes cache an 
arbitrary page and Process | performs a lock request. On the DM, the number of messages 


for each block that may need to be issued would be: 


(n invalidate messages + n update messages + lock granted) 


For “k” blocks this amount should be multiplied by k. 
For the LDM approach the number of messages would be considerably 


reduced (for the worst case) to: 


(n acquire lock messages + lock granted message) 


We describe the actions that are undertaken by both lock owner and requester in 


Figure 38. 


2. Lazy Data Merging: Read and Write Locks 


One of the extensions that LDM protocol introduces is the use of read and write 
locks. Lock ownership will only be modified when a lock is acquired for writing. Our 
locking semantics allow, at any time, multiple readers, but a single writer to access the 
critical section they protect. The example of Figure 39 illustrates the actions that should be 


taken for both read and write locks for the program listed below. 


Processor 1: Processor 2: 

Acq_Write_Lock (1) Acq_Read_Lock (1) 
A=A+1; D=A+1; 

Release_Lock (1); Release_Lock (1); 
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Discrete Time Line. 


tinte Node 1 Root Node 2 


Acquire Lock Write (1) 
Acquire Lock (1, WRITE). 


Lock_Granted [1, No-Write-Notices, No-Wait). 


A=: Request_Block (Block 2). 


S-bit block 2 == 0 then 
- Insert Node 1 on the block 
copyset. 
. Send the block to the requesting 


Block (BLOCK2, BLOCK2_DIFF); node. 


Acquire Lock Write (1). 
. Update the received data block ~ = 


- Mark the block as dirty. e_Lock (1, WRITE, VC2). 
. Create the twin copy. = 
. Change the protection to read- Lock owner = Node 1. Forward 

write. the request to nodel. 


. Perform the write A = 5. 






Forward Léck_Req (1, WRITE, VC 2, Node2) 


- Insert the lock request on the 
lock queue 
(Nade 2, VC2, WRITE). 


Release_Lock (1) 


- Update the loca! Vector 
Timestamp. 

. Forward to P2 the set of 
write-notices which VC1 are 
larger than VC2. 

. Create the diff for block 2. 

. Send the diff for block 2. 

- Move the lock owner to 
Node 2. 


Lock_Granted | (Write Notices, Diffs, No-Lock-Request} Pending). 


- Update the lock ownership to 
Node 2. 
- Update the Vector Timestamp. 
- Update the blocks for which diffs 
were received. 
- Invalidate blocks for which write- 


notices, but no diffs were received. 


A=6; 


- Mark the block as dirty. 

. Create the twin copy. 

. Change the protection to read- 
write. 

- Perform the write A = 6. 


Release_Lock (1) 


. Update the local Vector 
Timestamp (VC2). 


Figure 38: Lazy Data Merging. 
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Another point that should be made when comparing LDM and DM is the message 
size and the number of page requests that should be taken whenever a lock operation is 
performed. Assuming that the number of writes is equivalent to the number of reads to a 
shared page, we can assume that the average message size can be nearly 50% smaller than 
the message size on DM. This number can be significantly smaller depending on the ratio 
of reads/writes. If this ratio is relatively large (i.e., one write for the entire page) the gain 
becomes significant. On the other hand if the ratio is small (i.e., every word is written into) 
then the entire page is forwarded, resulting in the same performance of DM. 

The number of page requests will also be an issue. An acquire lock operation DM 
invalidates all cached pages, forcing new page requests for every page even if a small 
portion of the page is being concurrently accessed. In contrast, LDM does not invalidate 
any remote pages. It is assumed that before writing into a shared value, the programmer 
stipulates an appropriate synchronization operation. Therefore, the number of page 
requests 1s significantly reduced. The Hybrid coherence protocol should only locally 


invalidate pages for which it received a write-notice but no update messages. 


3. Data Merging and Lazy Data Merging Barrier Call 


As mentioned before, barriers have the property of enforcing consistency of the 
entire global memory or for designated portions of it depending on whether the barrier is 
of a converge type or not. The approach adopted for barriers requires that pages associated 
with a barrier object should be locally invalidated and, in most cases, it should have a 
slightly worse performance than DM. LDM delays sending all diffs to the home nodes until 
the barrier call (Figure 40b). In contrast, DM flushes the dirty pages to the corresponding 
GMU whenever a page should be replaced. DM distributes the communication during the 
entire computation, but it requires system support for enforcing that pages are redirected to 


the corresponding GMU (Figure 40a). The figure below compares the two approaches. 
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Discrete Time Line. 
Node 1 Root Node 2 
Acquire Lock Write (1) 
Acquire Lock (1, WRITE). 


Lock_Granted [1, No- Write-Notices, No- Wait). 


A=A+1; Request_Block (Block 2). 


S-bit block 2 == 0 then 
- Insert Node 1 on the block 2 
copyset. 
. Send the block to the requesting 
Block (BLOCK2, BLOCK2 DIFF); node. 


Acquire Lock Write (1). 





. Update the received data block 


- Mark the block as dirty. Acquife Lock (1, READ, VC 2). 
. Create the twin copy. = 

. Change the protection to read- Lock owner = Node 1. Forward 

write. 


the request to nodel. 
- Perform the write A = A + 1. 


Forward _Léck_ Reg (1, READ, VC 2, Node2) 


- Insert the lock request on the 
lock queue 
(Node 2, VC2, READ). 


Release_Lock (1) 


. Update the local Vector 
Timestamp. 

. Forward to P2 the set of 
write-notices which VC1 are 
larger than VC2. 


- Create the diff for block 2. 
. Send the diff for block 2. 


- Insert Node 2 on the lock set. 
Lock_Granted (Write_Notices, Diffs, Null). 


. Update the lock ownership to 
Node 1. 

. Update the Vector Timestamp. 

. Update the blocks for which diffs 
were received. 

- Invalidate blocks for which write- 


Request Block (Block 3). notices, but no diffs were received. 
- Insert Node 1 on the block 3 
copyset. D= A+ 1; 
. Send the block to the requesting 
node. Block (BLOCK3, BLOCK3_ DIFF); 
- Mark the block as dirty. 
- Create the twin copy. 
- Change the protection to read- 
write. 


- Perform the write D = A + 1;. 


Release Lock (1) 


. Update the local Vector 
Timestamp. 


. Send a release message to the lock 
owner. 


Release_Read_ Lock (1). 


- Remove Node 2 from the List of 
readers. 


Figure 39: Lazy Data Merging: read and write locks. 
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a 
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Figure 40: Comparison of DM and LDM during a barrier call. 


As can be observed, DM performs slightly better than LDM, by reducing the 
delay of a barrier operation. To minimize this problem we use diffs instead of flushing the 
entire data block. 

Figure 41 describes the execution of the program below. The node assigned 
as Root corresponds to the barrier manager and should be generally the node on which the 


system is being initiated. 


Processor 1: Processor 2: 
C=C+1; D=D+1; 
Barrier (1); Barrier (1); 
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Discrete Time Line. 
Time Node 1 Root Node 2 





C=C+1; 
. S-bit block 2 == 0 then D=D+1; 
- Insert Node 1 on the block 3 
copyset. . Generate a SISEGY signal 
- Mark the block as dirty. Goel ie TNONED of all 
. Create the twin copy. ages is TNO 
‘es the protection to read- Fash into the he appropriate hom 
node ==> N 
Perform the write C=C +1. . Send a request block number 3 
to node 1. 
Request_Block (Block 3). 
- Insert Node 2 on the block 3 
copyset. 
; ey the block to the requesting 
node. 
Block (BLOCK3, BLOCK3_DIFF); 
- Mark the block as dirty. 


. Create the twin copy. 
. Change the protection to read- 
write. 

. Perform the write D = D +1. 


Barrier (1); 


. Update the Vector Timestamp; P 
1,0). Barrier (1); 


“Create the diffs to block 3. 
pda e nome node: . Create the diffs for Block 3. 
‘Cav 1 from Block 3 . Update the Vector Timestamp 


: Change the page protection to 
Protection None ( this action 
is aun betal to invalidating 


: insert fie diffs on the page. 


[0,1). 
- Send diffs and VC to the Hom 
node, 


- Invalidate local copy of block 3 


Prot KONE to 






. Send a barrier message to the 
barrier Manager. 


Barrier (Node 1, Barrier 1). 


-Update the counter for barrier 1. 
counter = 0 
. Send a Transpose Barrier 
“Message to all nodes. 
else 
Reply Wait_at_ Barrier. 
Wait at Bp 


Update_Block (3, VC 2, Diff). 


- VC 2 is concurrent to VC 1. 
If Bitmask and diff write into 
the same location Conner <0. 


Bai PED Send a Cross Barrier message 
to nodes 1 and 2. 
. Reset the Counter to 2. 


Barrier (Nofie 2, Barrier 1); 


“update Block 3. For this case 
D will be equal to the new 


value. Cross_Barrier (1). 


Cross_Barrier (9). 


Figure 41; Lazy Data Merging: barrier call. 
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V. EUREKA: A “LAZY DATA MERGING” IMPLEMENTATION 


In this chapter we describe the design of “Eureka”, a prototype DSM system that 
provides a software implementation of the LDM consistency model. Portability and 
efficiency are our major goals. For portability we use regular Unix BSD 4.3™ system calls 
(mmap, mprotect, etc., that are wrapped by C++ class definitions). For efficiency we should 
build the system on a multithreaded environment using signal handlers for detecting both 
page faults (detected by catching SIGSEGV signals) and received messages (detected by 
catching SIGIO/SIGURG signals). 

In Section A we list the major system components. Section B summarizes the 
system runtime environment, by exemplifying the interactions that should be undertaken 
by our DSM system. Section C outlines some implementation details by presenting extracts 
from the actual system source code. By doing so, we hope to clarify the complexity that is 


involved in building such systems. 


A. DSM SYSTEM ORGANIZATION 


This section describes the organization of the “Eureka” DSM system. Eureka is 


composed of two types of entities: objects and threads. The objects are responsible for 


managing a specific data structure and are considered as “reactive” entities!, that is: they 
respond to actions by updating their internal state and/or by providing replies to data 
requests. Objects can also act upon other objects on behalf of an initial thread request. 


Threads are “active” elements that make use of object services. 


1. Objects 


There are four major objects that should be active during the system lifetime: 


¢ Synchronization Directory: this object is responsible for maintaining the state 
of each individual synchronization mechanism and for providing the interfaces 
that allow a computing thread to perform the necessary synchronization 
operations. 


1. Reactive objects are objects that take actions upon the reception of an external stimulus. 
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¢ Page Directory: this object maintains the state of each page that is currently 
mapped into the local memory. This directory accumulates the roles of managing 
local and non-local pages. It provides interfaces for both the Synchronization 
Directory and to the DSM thread. 


¢ Suspend Queue: this object manages the insertion and removal of pending 
page requests. It provides interfaces that are accessed by the timer and by the 
DSM threads. 


¢ Process Table: this object is responsible for storing the addresses of all nodes 
that integrate the workstation cluster. The process table involves multiple sub- 
directories. The driven reason is that the system is implemented on a 
multithreaded environment and more than one thread may issue messages. 
Therefore, we need to maintain the state not only for remote processes but also 
for the local threads. 


The following subsections will describe in more detail the interfaces that should be 


provided for these objects. 


2. Local Threads 


There should be at least three active local threads at any time: 


¢ Computing Thread: this thread embodies the user application. For 
synchronization with the other remote computing threads we introduce 
synchronization primitives (locks and barriers). In practice, these 
synchronization operations require access to the methods provided by the 
“Synchronization Directory”. 


¢ DSM thread: this thread is responsible for managing local/remote block 
requests. It also manages the creation of “diffs” and the merging of update data 
that is received from other nodes. In summary, this thread processes all actions 
that are related to memory management. 


¢ Timer thread: the unique role of this thread is to periodically scan the suspend 
queue. If there are no processes waiting on the suspend queue, the timer thread 
Should yield the execution to another thread, otherwise the timer thread should 
scan the suspend queue and issue timeout messages which are on the queue 
longer than the specified threshold value if there are requests pending. 


There should be two distinct signal handlers: 


¢ Communication handler (SIGIO/SIGURG signals); and 
¢ Memory handler (handles SIGSEGV signals). 


The communication handler is responsible for handling all received messages. 


Based on the type of operation of the incoming message it may be delivered to one of the 
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local threads or forwarded to another node. The memory handler, in turn, is responsible for 
detecting page faults. The page faults can be caused by either page absence or by violating 
the page protection. In case of protection violation the memory handler actions should vary 
accordingly to the protocol that is currently in use. Page absences should be handled by 
issuing a request to the appropriate “home” node. 

These threads interact through global data structures and signals (SIGSEGV for 
page faults/protection violation and S/GJO for messages) that are issued during program 


execution. 


B. EUREKA RUNTIME ENVIRONMENT 


This section describes the various actions that are undertaken by each one of the 
distinct entities of the system. Figure 42 (Figure 25 on Chapter IV, section D) provides a 


pictorial description of the subject. 


: Local Blocks 


Processing 


Thread 


(GMC + LMC)| 





Figure 42: Eureka Runtime Environment. 
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The Eureka DSM system during its execution can be in one of four states: 


¢ System Initialization: when the threads, global objects and statically declared 
data variables are initialized; 


¢ System Execution: when the computing thread is forked; 
¢ Data Gathering:when the results are collected; 


¢ System Termination:when the global data structures are deallocated and the 
remote threads are terminated. 


Section 1 describes the overall system activities giving an abstract view of the 
system behavior during the four phases. The remaining sections will complement this 


initial introduction with more detailed aspects of Eureka at each particular phase. 


1. Eureka Execution Overview 


The system session should be started by the user from one of the nodes specified at 
the Erk.hosts file. The start-up routine will consist of the call to the macro Erk_ Start (argc, 
argv) from within the main routine which will, in turn, be responsible for the initialization 
of the DSM threads on the various remote nodes. All global variables within the system 
should be initialized within a function Erk Init ( ). This function will be called from the 
Erk Start ( ) routine. After the initialization of the system’s global variables the Master 
node will spawn the remaining DSM threads on each of the hosts defined on the Erk.hosts 
file through the use of the “rsh’’ system call. Figure 43 describes the system initialization. 

After all DSM threads have been created, the dispatcher node should initialize the 
globally shared objects and synchronization variables. The main routine will then 
synchronize all threads through the use of a barrier call, which has no data associated with 
it, through the call “Erk_BarrierWait (Num Nodes, NO_DATA)” . This is a requirement for 
initiating the dispatch of the computing thread on the remote nodes. Upon crossing the 
barrier, the dispatcher should initiate the process of forking the computing threads, 


initiating the execution phase. 
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Figure 43: Eureka System Initialization. 


The dynamic allocation/dealocation of shared memory as well as the management 
of barrier primitives are centralized in the “dispatcher” node. The dispatcher should also 
be designated as root for the lock variables. When all computing threads have terminated, 
the global memory should be brought to a consistent state. This will be achieved by a call 
to the barrier primitive “Erk_Converge” . This barrier call has the dual role of acting as a 
control primitive, by synchronizing all processes, and of enforcing global memory 
consistency, by flushing all dirty blocks to their corresponding home nodes. For blocks 
which are clean the node should send a “CLEAN” message to the corresponding home 
node. 

The next phase consists of the “Data Gathering” phase and is performed solely by 


the dispatcher. In it the system will present the final result through a GUI and/or by 
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redirecting the results to a file. The instrumentation phase can be performed concurrently 
with computation by performing multiple Erk Converge calls. Although important for 
real-time applications, for which intermediate states are as important as the final one the 
overlap of Data Gathering tasks with computation may result on a significant loss in 


performance. A sample main routine should have the following format: 


main (argc, argv) _// argv display the command line options 
{ 
// \nitiatize the Master and later on the Slaves remote threads. 
Erk_Start (argc, argv); 
// lf the master initialize the Shared Objects and 
//Synchronization primitives among the Servers. 
// Initialize Shared Memory and Synchronization Primitives. 
Erk_Init ( ); 
if (CHILD) 
return 1: 
if (MASTER) { 
// Fork remote computing threads on the other nodes 
Erk_Spawn_Child (Computing thread); 
Execute own local computing Thread; 
// Instrumentation phase: 
Display Results; 
// Shutdown Remote threads 
Erk_Shutdown ( ); 


To minimize contention we adopt a “distributed/fixed” approach for partitioning 
the shared data across the nodes. Therefore, each node is designated “home” for a set of 
data blocks and becomes responsible for its management. The expected performance gain 
by this approach was stated by Stumm in [SZ90]: 


“,...One potential problem with the central server is that it may become a 
bottleneck, since it has to service the requests from all clients. To distribute the 
server load the shared data can be distributed onto several servers. In that case, 
clients must be able to locate the correct server for data access . 


Poe ee eee 
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2. Handlers Initialization 


a. Communication Port and Communication Handler 


The local port, remote port, and message objects are the three objects that 
should be instantiated as soon as the Process Table is created. Their roles are to provide the 
appropriate “UDP” communication channels that are required for the exchange of 
messages between the system nodes. The port object provides means to send asynchronous 
messages to a thread on a remote node. The message object is responsible for providing the 
semantics (blocking and non-blocking) for sending and receiving messages and to maintain 
storage for received messages. The remote port object temporarily stores data that describes 
the source of a received message. 

Each node should have a well known port number which is determined by 
the local ProcessID. The ProcessID represents the order that the nodes were read from the 
file “Erk.hosts” . The Dispatcher node will be assigned a well known port number (between 
1024 and 5000) and all other ports should consist of adding the ProcessID to the initial port 
number. By doing this we allow more than one process on a single node. If it is known that 
no two processes will ever be assigned to the same node, then the port number can have an 
arbitrary number greater than 1024. 

Received messages are handled in an asynchronous way by defining the 
appropriate signal handler for S/G/O signals and modifying the local port attributes through 
the use of the “fcntl’”’ system call with the flags “FASYNC | FNDELAY” .. Similar to Quarks 
[CKK95], once a message is received it is inserted in the message list for the specific 
thread, in the process subdirectory. If the message is a synchronous message, the thread that 
is blocked should then be awakened and the corresponding actions (i.e. lock acquired, data 
block granted, etc.) should be performed. If it is an asynchronous message (page request, 
lock request, etc.), it should be handled by the “message handler” (i.e., forward of lock 
requests, etc.) or delivered to the DSM thread (i.e., page requests, lock requests, updates, 


etc.). 
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b. Memory Handler 


The major role of the memory handler is to detect violations to specified 
pages access rights. The following table describes the relation between the current page 


protection value and the actions the memory handler will perform in the event of a 


“SIGSEGV” signal. 


Table 3: Page Protection Actions 


PROT NONE 
appropriate home node. Change the page 


protection as soon as a page is received to 
PROT READ. 


PROT READ If the page is associated with a lock that | PROT READ / 
was acquired for reading an error condi- | PROT WRITE or 
tion is reached and the program should | generate an_ error 
abort. Otherwise the protection is altered | condition. 
to PROT_READ / PROT_WRITE and the 
appropriate actions should be taken to 
mark the page as dirty and the creation of 
its twin copy. 





It is also possible to generate an invalid address. To deal with this particular 
case the memory handler should verify if the corresponding address maps into a page that 
has been inserted in the page table. If the page number is not valid then an error condition 


should be raised. 


3. Eureka Shared Data and Synchronization Objects Allocation 


This section describes the actions that should be undertaken by the Eureka DSM 
system for object allocation. This subject is divided into static/dynamic memory allocation 


and creation of synchronization objects. 
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a. Static Memory Allocation 


The allocation of statically defined shared objects is performed on each 
individual node. This action should be carried out by the “Erk Init ()” routine. The 
initialization of shared data is, consequently, performed on a distributed fashion. An image 
of the shared virtual address space (VAS) is allocated for every process, however, each 
node is responsible for the integrity of one portion of that space. For these portions we name 
the node as “Home” for this pages. 

Static memory allocation is performed at a higher level by the call to the 
function Erk_ShMalloc (Var_name, sizeof (Object ID) * Number of objects). The 
underlyin g system will, in turn, be responsible for allocating the appropriate data structures 
that should maintain the state of each individual data block. 

The allocation of a shared data object involves requests to the Page 
Directory. The Page Directory’s role is to maintain the state of each individual page 
mapped on the Node. Within each node, shared pages can be divided in two groups: global 
pages and local pages. Global pages are the shared pages for which the local node is 
designated as “Home” node and, therefore, is responsible for their management (i.e. 
merging updates, monitoring the number of cached copies on remote nodes, allocation/ 
dealocation of bitmasks, storing the list of write-notices, controlling the S-bit, etc.). In a 
page fault the page should be mapped to the corresponding Home node and perform the 
request. On the other hand, local pages are acquired remotely from their corresponding 
Home nodes and temporarily cached at the local process. 

The actual allocation of the shared data will consist of two basic steps: 


¢ Mapping the object into memory; and 
¢ Allocation of the Data structures that will manage the shared blocks. 


The algorithm that describes this two steps can be summarized as below. 


Erk_ShMalloc (Any_T * ObjectiD, int ObjectSize) 


// Map the shared object into memory - by using mmap system call or 
// shared memory allocation. 
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ObjectID = PageTable.Map Object (ObjectSize). 


// With the initial address and object size allocate the data structures 

// that are needed for managing the pages. In accordance with the 

// data partitioning policy, identify the pages that should be marked as 

// “global” pages, and initialize the data structures that are required for 

// its management. 

PageTable.UpdatePageTable (ObjectlID, ObjectSize). 

The description of the method Map_ Object is described in section C. For the 

method UpdatePageTable we provide the following algorithm: 


UpdatePageTable (ObjectID, ObjectSize) 
{ 


// Verify the number of pages that the object requires. 
NumPages = ObjectSize / PageSize; 


// For each page verity if the page is global - the node is the “Home” 
// or if the page is local. For global pages it is necessary to allocate elements 
// for providing control of the copyset elements. 
for (| = 0 to NumPages -1) loop 
if (IsGlobal (PagelD + PageSize * |) ) { 
createControl (PagelD + PageSize * |, GLOBAL); 
} 


else 
createControl (PagelD + PageSize * 1, LOCAL); 


In reality, both types of page objects (global and local) are constructed in a 
similar way, but they behave differently. Global pages should allocate copyset lists and 
perform the coherence operations on updates that are received from the copyset elements. 

The major difference between this method and the one introduced by Munin 
and later on by Quarks is that in these two DSM systems the allocation and initialization of 
the shared data is performed ata single and predefined node. After the allocation, pages are 
transferred to remote nodes on demand. This approach presents a relatively high startup 
time, which can penalize short programs. 

In Eureka we propose that the UNIX™ “fork” semantics to be followed by 
performing the allocation of objects defined as shared in parallel. Therefore, when a remote 


DSM thread is forked the corresponding shared address space is also allocated. The 
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reduction on the startup time will depend of the data partitioning algorithm that is provided. 


In Section C we describe the implementation details for the above routines. 


b. Dynamic Memory Allocation 


For dynamic allocation of shared data we adopt a centralized approach. 
Therefore, the dispatcher node is assigned the controller for dynamic data allocation and 
deallocation. As in other implementations (Munin, Quarks, etc.) this problem is solved 
through the use of “RPC” (Remote Procedure Call) style calls to the dispatcher. It is 
required that the user provide the appropriate “stubs” for each method that is needed (e.g., 
allocation/deallocation, read and writing into the object). The operations that are 
performed on these shared objects (1.e., Queues, Lists, etc.), should be generally be 
mutually exclusive. For an appropriate result, the user might need to protect the critical 


section of its code with locks. 


C. Creation of Lock Objects 


Locks are managed in a distributed fashion, using the distributed queue 
algorithm suggested by Florin in [FBYR88]. The requests for lock creation and 
management should be performed within the “Erk Init’ routine. The global LocklId will 
correspond to the actual address of the synchronization object, that should be mapped in 
global memory by using the “mmap” system call. Within the node context of a node the 
LockID corresponds to its index on the Synchronization Directory. The Synchronization 
Directory corresponds to a table in which the locks are stored. The constructor for each lock 
object should include the initial lock owner (the Dispatcher) and the list of pages that are 
associated with the lock. The algorithm for lock allocation is described below. 

LockID SynchDirectory.CreateLock (ObjectID *ObjectinitAddr, int Offset, 
int Size) 
Static int i = 0; 
// \f ObjectAddr == NULL the lock is a control lock, therefore, no 


// data is associated with it. Otherwise, compute the pages that 
// need to be verified at the time of a lock release. 
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If ((ObjectAddr != NULL) a (Size > 0)) { 


// Based on the address build the list of pages that should be associated 
// with the lock. 

Page_List = Build_ListOfPages (ObjectinitAddr, Offset, Size); 

// Create the new lock having the Dispatcher node as root. 

Lock _List [i] = new Lock (Dispatcher, Page_List): 


else 
Lock_List [ i] = new Lock (Dispatcher, NULL); 
return Lock_List [ i++ ]->LockID ( ); 


} 
The Synchronization Directory is, in essence, a lock table that provides the 
appropriate mechanisms for allocating, deallocating, acquiring, and releasing of locks. 
Upon its creation it needs the information of the address from the Page Directory, so that it 


can request its services such as creation of diffs and update the pages’ Vector-Timestamp. 


d. Creation of Barrier Synchronization Objects 


The creation of barrier objects should be performed in two steps: 


¢ Associate the barrier object with the shared pages that should be updated 
at the time of a release. 
¢ Register with the barrier manager (Dispatcher node). 


Barrier calls can also be used solely for synchronization purposes. If that 
happens barriers will not update global memory. 

Barrier Converge primitives will require that each node traverse its 
corresponding Page Table and forward the diffs for pages that are dirty to the 
corresponding Home node, forcing global memory to enter in a consistent state. The 
algorithm below describes these actions. 


BarrierlD Erk_CreateBarrier (ObjectID *ObjectinitAddr, int Offset, 
int Size, int NumProcesses) 
{ 


Static int i = 0; 
// lf ObjectAddr == NULL the Barrier is a control lock, therefore, 
// no data is associated with it. Otherwise, compute the pages that 
// need to be verified at the time of a barrier_wait call. 
If ((ObjectAddr != NULL) && (Size > 0)) { 
Page_List = Build_ListOfPages (ObjectAddr, Offset, Size): 
Barrier_List [ i] = new Barrier (Dispatcher, Page_List, i): 
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Barrier_List [ i] = Page_addr (Page_Table_addr); 
} 


else 
Barrier_List [i ] = new Barrier (Dispatcher, NULL, i); 
return Barrier_List[i++]->RegisterBarrier(Thread!D, NumProcesses); 


} 


Although barriers are managed using a centralized approach, a regular barrier call 
is processed in two steps. The first one consists of sending updates (diffs) to the 
corresponding home nodes for pages that are dirty and CLEAN messages for those pages 
which are clean. Once the first step is performed the Barrier Primitive will call the barrier 


manager and wait until a transpose reply is received. 


4. Execution Phase 


In this section we describe the desired system behavior during the Execution Phase. 
In Eureka data and synchronization management are closely related. Recall that for 
correctness we rely on the appropriate use of synchronization primitives. 

After the initialization of global data structures (i.e. Process Table, Synchronization 
Directory, etc.) the computing thread is forked. At this point in time the system can be 
viewed as three threads running concurrently (timer, DSM and computing threads) and 
globally defined objects (Page Directory, Synchronization Directory, Process Table and 
Communication objects) upon which they should act. The next paragraphs describe how 
these threads and objects interact. We describe these interactions by listing the objects and 
how each thread relates to it. The algorithmic details are explained in Chapter IV (sub- 


subsections c, d, e, and f of section E, subsection 2). 


a. Suspend Queue 


Both the Timer and DSM threads will act over the Suspend Queue. The 
former by performing insertions and deletions of the IDs of processes which are waiting for 


an arbitrary data block and the latter by verifying if a timeout has occurred. 
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b. Page Directory 


The Page Directory object receives messages from the DSM and Timer 
threads and eventually from the memory signal handler (SISEGV signal handler). It also 
processes requests from the Synchronization Directory for creation of diffs. The DSM 
thread will interact with the Page Directory whenever a data request/update message is 
received. The actions for each one of these messages are described in Chapter IV. 

Whenever a page fault occurs, the memory handler should take the actions 
specified in Table 3. The signal handler should consult the Page Table in order to verify the 
protection attribute of the faulty page. Depending on the protection attribute, it may be 
needed to perform a page request to the home node. If the page is already in memory the 
actions can be reduced to modification of its protection. 

The interactions between the Timer thread and the Page Directory take place 
whenever a timeout occurs. At this instant, the Timer should verify the copyset of processes 
that currently cache a copy of that page and invalidate their pages. Once all updates have 
been received and appropriately merged, the Counter for that page will equal 0 and the 
Timer thread should be able to remove the ProcessID from the Suspend Queue and forward 


the requested page. 


c. Synchronization Directory and Page Table 


The computing thread, Synchronization Directory, and Page Table should 
interact whenever a synchronization operation is performed. At the time of an acquire_lock 
operation a request should be issued to the Current Lock Owner. Upon the receipt of an 
“acquire” message it will be followed by the set of write-notices as well as by the updates 
that were issued by the last lock owner. The received write-notices should be appended/ 
coalesced to the existing set and the diffs should be inserted into memory. The pages for 
which write-notices and no updates were received should be Invalidated (by setting their 


protection to “PROT NONE” ) and they should be marked as pages that are associated with 
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a lock by inserting the identity (by inspecting the write-notice tuple) of the node which has 
last performed a write operation into the page. 

At the time of a release operation the Synchronization Directory should inspect the 
lock queue for pending requests. If there are any, the Synchronization Directory should 
request that the Page Table the creation of diffs for pages marked as dirty (only the pages 
associated with the lock), and should also update the state of the write-notices for these 
pages and then forward this information with the lock granted message and the list of 
processes waiting for this lock. 

Barrier calls should behave in a similar manner, except that all diffs and write- 
notices that are associated with the barrier object are forwarded to the corresponding home 
node and all data that is associated with the barrier should be locally invalidated.This will 
ensure that values that are shared among multiple nodes will be updated after the barrier 


have been “crossed”’. 


d. Sending / Receiving Messages 


Messages are exchanged in Eureka for several different purposes and 
multiple threads can issue a message. Therefore, when sending a message, it is not enough 
to observe the IP address. We also need to specify to which remote thread this message 
Should be delivered. To fulfill this requirement some conventions were adopted. The first 
one is how to specify the Thread ID. The underlying threads package (Cthreads) provides 
a unique handle for each local thread. At the creation of any local thread this handle should 
be inserted into the Local Process Table. This will be useful for providing joining/detaches 
of threads. But what is really needed is a means for globally identifying each remote thread. 
This will be achieved by combining the Global Process ID with the thread number. The 
combination is performed as below: 


Thread!D = Global Process ID << 16 | LocalThreadID. 


Therefore, the LocalThreadID for the DSM thread running at the Dispatcher 


node will be the number 0, the Timer thread 1, and the computing thread 2. For the Process 
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1 the DSM thread will be the number 65536 and so on. For decoding the ThreadID we apply 
the reverse process. 


LocalThreadID = ThreadID & OOOOFFFF 


By doing this the system can appropriately identify the source and 
destination threads for each message. A Eureka’s typical message is composed by the fields 
described in Figure 44. The destination and source ThreadID are necessary to identify the 
threads within the context of a process. The Operation Code is used to identify the type of 
message that is currently being used. The Size field specifies the size of the data part. The 
Packet Number is necessary for messages that contain more than one datagram packet. The 


message header size is 20 bytes. 


4 bytes 2 bytes 2 —_ 


Destination Source oe Message Packet 
ThreadID ThreadID Code Family Number 


Figure 44: Message header format. 





One particular situation for the data field is the case in which the Operation 
Code consists of an Update message. For this case the message data field may have one of 
two formats. The first one consists of the diff bitmask and has a size of 128 bytes (Figure 
45). Each bit maps to a word within the page. A bit set to one represents the word is dirty 
and a “0” represents a clean word. 

At runtime, whenever a message arrives at a node a “SIGIO” signal is 
raised. The signal handler should then either post the message into the message list for the 
appropriate local thread (DSM thread for page requests, updates, etc., compute thread for 


Bypass-cache and synchronization messages), forward the message to another node (i.e., 
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lock requests) or even discard the message and reply with an “JNVALID” message, if there 


is no such thread on the node. 








Figure 45: Diff message. 


Whenever a thread sends a synchronous message or is waiting for a message 
to arrive (DSM thread), it should periodically peek on the message list. If the list is not 
empty it should handle the message, otherwise the given thread will yield the execution to 


another thread. The C++ code below describes this actions: 





Receive_Message (ThreadID) 


while (LocalProcessT able [ ThreadID ]->MessageList.IsEmpty ( ) ) 
cthread_yield (); 
return LocalProcessTable [ ThreadiD ]->MessageList.GetMessage ( ); 


} 


é. Operation Codes 


The operation codes are divided in four basic types: 


¢« Memory management requests; 

¢ RPC calls; 

¢ Synchronization messages; 

¢ System control messages. 

These operations are summarized in the following paragraphs. 


Memory Management Messages: 


¢ REQUEST BLOCK: this type of message is sent to the home node of 
the corresponding block. The message data field is inserted with the block 
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number. 


¢ UPDATE_DIFF: this message will consist of an update message as the 
result of a page flush or process termination. 


* WAIT_FOR_BLOCK: whenever a process poses a request and the 
Suspend-Bit for that page is set at the home node it should reply with a 
wait message such that a unit does not need to timeout and perform the 
request again. This action should force the requesting thread to block and 
wait until the reply arrives. The data field should contain the page number 
for which the request was performed. 


«BYPASS READ: this message should carry the address and size of the 
data to be read. 

¢ BYPASS READ_REPLY: this message carries the size and the actual 
data that has been read (the data is considered to be placed at a contiguous 
address). 


¢ BYPASS WRITE: the data field contains the initial address, the number 
of bytes to write and the actual data. When the operation is completed the 
Home node should reply with a “DATA_ WRITTEN” message. 


¢ DATA_WRITTEN: a reply from the Home node is issued when the 
bypass_write operation is completed (when the remote write has been 
performed). 


¢ INVALIDATE BLOCK: whenever the timeout value expires the Timer 
thread will issue a Flush_Block type message for the corresponding 
block. Only the LOCAL processes with the corresponding block dirty will 
reply with an Update_Block type message. The home node will, in turn, 
update the corresponding blocks, set the S-bit to one, and decrement the 
counter. This type of message has the same semantics of write invalidate 
when used in combination with the Locking messages. 


¢ CLEAN: the clean message is used to identify a page that is not dirty. Its 
purpose is to remove the nodeID from the page copyset at the home node. 


¢WRITE_NOTICE: a list of write notices. They consist of the number of 
write-notices plus the actual list of tuples “(BlockNumber, Vector 
Timestamp, Last Writer ID)”. 


RPC calls: 

As mentioned before, we use a centralized algorithm for managing dynamic 
memory. The dispatcher node should execute the stub function and return the reply to the 
requesting node. 


* RPC_CALL: the client node should pack the stub’s name and 
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arguments, if any, with this message. 


¢RPC_REPLY: consist of the reply form the server stub to the requested 
method. 


Synchronization Messages: 


¢ ACQUIRE_LOCK: lock request performed by the source of the message 
to the “probable” owner. The data field for this message will consist of 
the LockID, if the lock should be acquired for write or reading and the 
node’s current Vector Timestamp. 

¢ LOCK_GRANTED: when the lock request was granted. 


¢ LOCK_QUEUE: the list of nodes that are waiting for the lock. (Consist 
of NodeID and Vector Timestamp). 

¢ FORWARD LOCK REQUEST: this message is sent whenever the 
process is not the current lock owner. The lock request is forwarded to the 
next current owner. The arguments for this message are the data field 
from the original message as well as the NodeID of the requesting node. 
¢ WAIT FOR _LOCK: reply issued by the lock owner when the lock 
requested has been inserted on the lock queue. 

¢ WAIT AT BARRIER: a barrier call, performed by the remote nodes. 
The barrier manager should in turn decrement the counter and either reply 
with a TRANSPOSE _ BARRIER or BARRIER WAIT, depending if the 
counter value equals 0 or not. 


¢ BARRIER_WAIT: a reply from the barrier manager sent whenever the 
counter value > 0. 


¢ CROSS BARRIER: issued by the manager to all processes registered at 
the barrier waking then up. 

¢ CONDITION WAIT: call to a given condition variable. The callee 
should block until the condition becomes true. 


¢ WAIT_FOR_CONDITION: reply from the condition variable manager. 
It means that the condition is false. 


¢ CONDITION _SIGNAL: done by any thread informing that the condition 
is now true. 


¢ CONDITION_BROADCAST: wake-up all threads that are currently 
waiting at the given condition signal. 


System Control Messages: 


¢ ACK: issued whenever a reply is needed, but no action by the receiver 
is needed. 
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¢ FORK_THREAD: fork a new remote thread at the destination node. 
¢ SYSTEM _ SHUTDOWN: terminate all remote operations. 


¢ TERMINATED: the reply when the node is ready to terminate its 
threads. 


5. Data Gathering Phase 

This phase is initiated after a call to a barrier converge primitive by all Eureka 
nodes. The result of this call is that the global memory should enter in a consistent state. 

Upon reception of a “CROSS BARRIER” call the dispatcher node should start 
collecting the necessary data. This task is performed by the underlying system, becoming 
transparent to the user. The user is responsible for specifying the procedure that should be 


run at the dispatcher node after the barrier_converge call. 


6. Termination Phase 

The termination phase is initiated by the dispatcher after the Data_Collection phase 
and consists of a “SYSTEM_SHUTDOWN” message sent to all nodes, issued by the 
Dispatcher node. Upon receipt of this message, each node should graciously terminate all 
threads, deallocate its global objects and unmap the shared global memory. Finally, each 
node replies with a TERMINATED message. When all nodes have terminated the 


dispatcher will be ready to finish. 


C. CODE EXAMPLES 


This section describes the actual details for implementing the DSM system. 


1. Creation of an UDP Port 


The constructor for a local UDP Port object should be initialized with the host name 


and the corresponding port number. 


UDP_Port::UDP_Port (char “hostName, unsigned short Port) 


{ 
localAddr.sin_family = AF_INET; 





r 
* Open an UDP socket. 
ot 
if ( (sockfd = socket (localAddr.sin_family, 
SOCK_DGRAM, 0) ) <0) { 
printf ("Value of sin_family = %d \n", localAddr.sin_family ); 
perror ( "Cannot allocate socket") ; 


/* 

* Now bind our local address so that the other processes 
* can find us. 

i 

bzero ((char *) &localAddr, sizeof (localAddr) ): 
localAddr.sin_addr.s_addr = INADDR_ANY; 
localAddr.sin_port = Port; 


if ( bind (socktd, (struct sockaddr *) &localAddr, sizeof (localAddr)) < 0) { 
printf ( "Value of sockfd = %d\n", sockfd); 
perror ( "Cannot bind socket" ); 


#ifdef DEBUG 


printf ("Done with initialization of the UDP socket \n."); 
#endif 


} 








To receive a message the node should perform a call to the method rcvMsg. This 


call will be performed within the signal Handler for the SIGIO system call. 
ia 
“ Receive a message from a remote node and returns the sender info plus a 
“ pointer to the buffer and the size of the message just received. 
of 
int 
UDP_Port::rcvMsg (RemotePort *From, char *msg, int maxLgth) 
{ 
ints = 0; 
int revDatagramSize = 0; 


i 

“If From is not a NULL pointer, the source address of the 
“message is filledin. S is a value-result parameter, 

* initialized to the size of the buffer associated with From, 

* and modified on return to indicate the actual size of the 
“address stored there. The length of the message is 
“returned. If a message is too long to fit in the supplied 

* buffer, excess bytes may be discarded depending on the type 
“ of socket the message is received from. 








*/ 
S = sizeof (From->remoteAddr); 


if ( (rcvDatagramSize = recvfrom (sockfd, msg, maxLgth, 0, 
(struct sockaddr *) &(From->remoteAddr), &s) ) < 0) 
{ 
perror("Error in Recvfrom"); 
if (errno != EWOULDBLOCk) 
{ 


perror("Error in Recvtrom"); 
} 
else 
rcvDatagramSize = 0; 
} 


return rcvDatagramSize; 


} 


Asynchronous I/O 


ia 
* 1- Now we need to set the process ID to receive the SIGIO or SIGURG 
* signals for the socket associated with fd.This is done with the 
“command F_SETOWN. (S_SETOWN > 0 -> process ID and 
i S_SETOWN < 0 -> process Group ID ). 
* 2- Later on we need to set a flag for FASYNC (Signal process group when 
* ready) and FNDELAY (Nonbiocking I/O). This is done with the command F_SETEFL. 
si 
if (fnctl (sockfd,F_SETOWN, getpid ( )) < 0) 
perror (“UNIX PORT F_SETOWN error"); 


if (fnctl (sockfd, F_SETFL, FNDELAY | FASYNC ) < 0) 
perror ("UNIX PORT F_SETFL error’); 


ial 


2: DSM System Calls 


Create a single logical address space: 


[* 
* A zero special file is a source of zeroed unnamed memory. This file is 
* of infinite length. Mapping a zero special file creates a zero-initialized 
* unnamed memory object of a length equal to the length of the mapping rounded 
* up to the nearest page size as returned by getpagesize. Multiple processes 
“canshare such a zero special file object provided a common ancestor mapped 
* the object MAP_SHARED. 
| 
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if ((mapFD = open ("/dev/zero", O_RDWR, umask(0))) < 0) 
printf ("Couldn't open the desired FD for mmap\n"); 


else { 
if ((virtualBaseAddr= mmap (0, regionSize, 
PROT_NONE, MAP_PRIVATE, mapFD, 0)) <0) { 
perror ("Cannot Map the correct value"); 
printf ("Couldn't allocate memory \n"): 
} 
else { 
matrixA = (my_type *) virtualBaseAdar: 
printf (" address of matrixA = %lu \naddress of virtualAddr = %lu", 
matrixA, (long unsigned) virtualBaseAddr); 


} 





To modify its protection: 


void 
segv_handler (int sig, int code, struct sigcontext “context, char *addr) 


{ 


int pageSize = getpagesize ( ); 
int pageAlignedAdar; 








spihigh (); // maximize the thread priority. 
sigblock (sigsetmask (SIGSEGV));  // block SIGSEGV signals. 


printf ("\n******* PAGE FAULT on Virtual address = %lu \n\n", 
(unsigned long) addr); 
printf ("PAGE SIZE = %d \n", pageSize): 


ie 

“ Page align the faulty address otherwise mprotect will refuse 
“to change protection. 

ef 


if (( pageAlignedAddr = (unsigned long) addr % pageSize) != 0) 
pageAlignedAddr = (unsigned long) addr - pageAlignedAdar; 
else 
pageAlignedAddr =(unsigned long) addr; 


ig 
“ Now change the protection of the the given address so that | can write 
* into it. 
*/ 
if ( mprotect ( (char “)pageAlignedAddr, pageSize, 
PROT_READ | PROT_WRITE) < 0) 
printf ("Setting an invalid address \n"); 


sigsetmask (0); // unblock ail signals. 








spllow ( ): // reduce the thread priority. 


3. How to Present Debug Information 
To be able to monitor a program running on multiple nodes we allow the display of 


multiple windows at the dispatcher node through the use of the rsh and xterm system calls. 


sprintf(forrsh, " %S%S%SRVS%VSVSRVRAVSHS*SVSVSVs%s", 
"rsh", hostAddress , "/usr/bin/X1 1/xterm -display ", masterAddress ":0.0", 
"-title Pid_", i"-", hostAddress, geom_string, " -e ", path, 
"/matmult /users/work4/tavares/THESYS/WORK/result", 
" /users/work4/tavares/THESYS/WORK/Erk.hosts &"); 
#ifdef DEBUG 
printf("Rsh command: <%s>\n", forrsh); 
#endif 


/* Now execute the remote command */ 
system (forrsh); 
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VI. CONCLUSION 


The introduction of fast networks (e.g. ATM standard) made both message passing and 
shared memory systems feasible alternatives for solving computationally intensive 
problems at a very low cost. They allow combining clusters of workstations into a single 
abstraction. The major advantage of shared memory systems over their message passing 
counterparts is that such systems relieve the programmer from the burden of worrying 
about data movement, which for some applications can become a very complex task. 

In this thesis we have performed a comprehensive description and analysis of existing 
memory consistency models and DSM systems using representative examples of each 
category (Chapters I and ID). 

Based on studying of existing DSM consistency models and their implementations, we 
modified Data Merging to obtain a new protocol, “Lazy Data Merging”, which 
incorporates features from both Lazy Release Consistency and Entry Consistency memory 
models. 

The analysis of multiple DSM systems implementations were particularly important for 
the design of Eureka, a DSM system that implements the Lazy Data Merging consistency 
model. To ensure portability we use standard Unix™ system calls (i.e. mprotect, mmap, 
etc.). Our expectations are that as is the case of PVM and MPI, a portable implementation 
of a DSM system should contribute for disseminating their use among the scientific 
community. In Chapter V we provide the indications for this path. 

Eureka is a partial implementation of the LDM consistency model. Our preliminary 
results corroborate the protocol correctness and possibilities to provide performance 
enhacements. Quarks [CKK95] is also a portable DSM system developed at the University 
of Utah, and currently provides an implementation of the Eager Release Consistency 
Model. In order to accelerate the implementation of Eureka we have reused some of 
Quark’s basic components. Our initial performance measurements provide results that are 


comparable to the original system, but our expectations are that, when fully implemented, 
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Eureka would present superior results since it implements a more relaxed memory 


consistency model. 


A. SUGGESTIONS FOR FUTURE WORK 


As mentioned before, Eureka partially implements the LDM protocol and the 
conclusion of the implementation of synchronization primitives is needed. Further research 
is also required for instrumenting the system and collecting an appropriate set of data (i.e., 
number of page faults, number of diffs for each page and average number of diffs per page, 
number of synchronization accesses, start-up time, average size and number of messages, 
etc.). Based on these statistics some improvements can be achieved, since we have 
designed the protocol adopting a conservative approach. 

Other open questions are the consideration of the price paid for implementing a 
portable solution when compared to system-oriented ones and the benefits of adaptive 


versus static data mapping policies. 
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